Feeling of Presence Maximization: mmWave-Enabled Virtual Reality Meets Deep Reinforcement Learning

Peng Yang, , Tony Q. S. Quek, , Jingxuan Chen, Chaoqun You, , and Xianbin Cao P. Yang, T. Q. S. Quek, and C. You are with the Information Systems Technology and Design, Singapore University of Technology and Design, 487372 Singapore. J. Chen and X. Cao are with the School of Electronic and Information Engineering, Beihang University, Beijing 100083, China.

Abstract

This paper investigates the problem of providing ultra-reliable and energy-efficient virtual reality (VR) experiences for wireless mobile users. To ensure reliable ultra-high-definition (UHD) video frame delivery to mobile users and enhance their immersive visual experiences, a coordinated multipoint (CoMP) transmission technique and millimeter wave (mmWave) communications are exploited. Owing to user movement and time-varying wireless channels, the wireless VR experience enhancement problem is formulated as a sequence-dependent and mixed-integer problem with a goal of maximizing users’ feeling of presence (FoP) in the virtual world, subject to power consumption constraints on access points (APs) and users’ head-mounted displays (HMDs). The problem, however, is hard to be directly solved due to the lack of users’ accurate tracking information and the sequence-dependent and mixed-integer characteristics. To overcome this challenge, we develop a parallel echo state network (ESN) learning method to predict users’ tracking information by training fresh and historical tracking samples separately collected by APs. With the learnt results, we propose a deep reinforcement learning (DRL) based optimization algorithm to solve the formulated problem. In this algorithm, we implement deep neural networks (DNNs) as a scalable solution to produce integer decision variables and solve a continuous power control problem to criticize the integer decision variables. Finally, the performance of the proposed algorithm is compared with various benchmark algorithms, and the impact of different design parameters is also discussed. Simulation results demonstrate that the proposed algorithm is more $4.14\%$ energy-efficient than the benchmark algorithms.

Index Terms:

Virtual reality, coordinated multipoint transmission, feeling of presence, parallel echo state network, deep reinforcement learning

I Introduction

Virtual reality (VR) applications have attracted tremendous interest in various fields, including entertainment, education, manufacturing, transportation, healthcare, and many other consumer-oriented services [1]. These applications exhibit enormous potential in the next generation of multimedia content envisioned by enterprises and consumers through providing richer and more engaging, and immersive experiences. According to market research [2], the VR ecosystem is predicted to be an $80$ billion market by 2025, roughly the size of the desktop PC market today.

However, several major challenges need to be overcome such that businesses and consumers can get fully on board with VR technology [3], one of which is to provide compelling content. To this aim, the resolution of provided content must be guaranteed. In VR applications, VR wearers can either view objects up close or across a wide field of view (FoV) via head-mounted or goggle-type displays (HMDs). As a result, very subtle defects such as poorly rendering pixels at any point on an HMD may be observed by a user up close, which may degrade users’ truly visual experiences. To create visually realistic images across the HMD, it must have more display pixels per eye, which indicates that ultra-high-definition (UHD) video frame transmission must be enabled for VR applications. However, the transmission of UHD video frames typically requires $4-5$ times the system bandwidth occupied for delivering a regular high-definition (HD) video [4, 5]. Further, to achieve good user visual experiences, the motion-to-photon latency should be ultra-low (e.g., $10-25$ ms) [6, 7, 8]. High motion-to-photon values will send conflicting signals to the Vestibulo-ocular reflex (VOR) and then might cause dizziness or motion sickness.

Hence, today’s high-end VR systems such as Oculus Rift [9] and HTC Vive [10] that offer high quality and accurate positional tracking remain tethered to deliver UHD VR video frames while satisfying the stringent low-latency requirement. Nevertheless, wired VR display may degrade users’ seamless visual experiences due to the constraint on the movement of users. Besides, a tethered VR headset presents a potential tripping hazard for users. Therefore, to provide ultimate VR experiences, VR systems or at least the headset component should be untethered [6].

Recently, the investigation on wireless VR has attracted numerous attention from both industry and academe; of particular interest is how to a) develop mobile (wireless and lightweight) HMDs, b) how to enable seamless and immersive VR experiences on mobile HMDs in a bandwidth-efficiency manner, while satisfying ultra-low-latency requirements.

I-A Related work

On the aspect of designing lightweight VR HMDs, considering heavy image processing tasks, which are usually insufficient in the graphics processing unit (GPU) of a local HMD, one might be persuaded to transfer the image processing from the local HMD to a cloud or network edge units (e.g., edge servers, base stations, and access points (APs)). For example, the work in [1] proposed to enable mobile VR with lightweight VR glasses by completing computation-intensive tasks (such as encoding and rendering) on a cloud/edge server and then delivering video streams to users. The framework of fog radio access networks, which could significantly relieve the computation burden by taking full advantages of the edge fog computing, was explored in [11] to facilitate the lightweight HMD design.

In terms of proposing VR solutions with improved bandwidth utilization, current studies can be classified into two groups: tiling and video coding [12] As for tiling, some VR solutions propose to spatially divide VR video frames into small parts called tiles, and only tiles within users’ FoV are delivered to users [13, 14, 15]. The FoV of a user is defined as the extent of the observable environment at any given time. By sending HD tiles in users’ FoV, the bandwidth utilization is improved. On the aspect of video coding, the VR video is encoded into multiple versions of different quality levels. Viewers receive appropriate versions based on their viewing directions [16].

Summarily, to improve bandwidth utilization, the aforementioned works [13, 14, 15, 16] either transmit relatively narrow user FoV or deliver HD video frames. Nevertheless, wider FoV is significantly important for a user to have immersive and presence experiences. Meanwhile, transmitting UHD video frames can enhance users’ visual experiences. To this aim, advanced wireless communication techniques (particularly, millimeter wave (mmWave)), which can significantly improve data rates and reduce propagation latency via providing wide bandwidth transmission, are explored in VR video transmission [4, 17, 18]. For example, the work in [4] utilized a mmWave-enabled communication architecture to support the panoramic and UHD VR video transmission. Aiming to improve users’ immersive VR experiences in a wireless multi-user VR network, a mmWave multicast transmission framework was developed in [17]. Besides, the mmWave communication for ultra-reliable and low latency wireless VR was investigated in [18].

I-B Motivation and contributions

Although mmWave techniques can alleviate the current bottleneck for UHD video delivery, mmWave links are prone to outage as they require line-of-sight (LoS) propagation. Various physical obstacles in the environment (including users’ bodies) may completely break mmWave links [19]. As a result, VR requirements for a perceptible image-quality degradation-free uniform experience cannot be accommodated. However, the mmWave VR-related works in [4, 17, 18] did not effectively investigate the crucial issue of guaranteeing the transmission reliability of VR video frames. To significantly improve the transmission reliability of VR video frames under low-latency constraints, the coordinated multipoint (CoMP) transmission technique, which can improve the reliability via spatial diversity, can be explored [20]. Besides, it is extensively considered that proactive computing (e.g., image processing or frame rendering) enabled by adopting machine learning methods is a crucial ability for a wireless VR network to mandate the stringent low-latency requirement of UHD VR video transmission [1, 21, 19, 22]. Therefore, this paper investigates the issue of maximizing users’ feeling of presence (FoP) in their virtual world in a mmWave-enabled VR network incorporating CoMP transmission and machine learning. The main contributions of this paper are summarized as follows:

•

Owing to the user movement and the time-varying wireless channel conditions, we formulate the issue of maximizing users’ FoP in virtual environments as a mixed-integer and sequential decision problem, subject to power consumption constraints on APs and users’ HMDs. This problem is difficult to be directly solved by exploring conventional numerical optimization methods due to the lack of accurate users’ tracking information (including users’ locations and orientation angles) and mixed-integer and sequence-dependent characteristics.
•

As users’ historical tracking information is separately collected by diverse APs, a parallel echo state network (ESN) learning method is designed to predict users’ tracking information while accelerating the learning process.
•

With the predicted results, we develop a deep reinforcement learning (DRL) based optimization algorithm to tackle the mixed-integer and sequential decision problem. Particularly, to avoid generating infeasible solutions by simultaneously optimizing all variables while alleviating the curse of dimensionality issue, the DRL-based optimization algorithm decomposes the formulated mixed-integer optimization problem into an integer association optimization problem and a continuous power control problem. Next, deep neural networks (DNNs) with continuous action output spaces followed by an action quantization scheme are implemented to solve the integer association problem. Given the association results, the power control problem is solved to criticize them and optimize the transmit power.
•

Finally, the performance of the proposed DRL-based optimization algorithm is compared with various benchmark algorithms, and the impact of different design parameters is also discussed. Simulation results demonstrate the effectiveness of the proposed algorithm.

II System Model and problem formulation

As shown in Fig. 2, we consider a mmWave-enabled VR network incorporating a CoMP transmission technique. This network includes a centralized unit (CU) connecting to $J$ distributed units (DUs) via optical fiber links, a set $\mathcal{J}$ of $J$ access points (APs) connected with the DUs, and a set of $\mathcal{U}$ of $N$ ground mobile users wearing HMDs. To acquire immersive and interactive experiences, users will report their tracking information to their connected APs via reliable uplink communication links. Further, with collected users’ tracking information, the CU will centrally simulate and construct virtual environments and coordinately transmit UHD VR videos to users via all APs in real time. To accomplish the task of enhancing users’ immersive and interactive experiences in virtual environments, joint uplink and downlink communications should be considered. We assume that APs and users can work at both mmWave (exactly, 28 GHz) and sub-6 GHz frequency bands, where the mmWave frequency band is reserved for downlink UHD VR video delivery, and the sub-6 GHz frequency band is allocated for uplink users’ tracking information transmission. This is because an ultra-high data rate can be achieved on the mmWave frequency band, and sub-6 GHz can support reliable communications. Besides, to theoretically model the joint uplink and downlink communications, we suppose that the time domain is discretized into a sequence of time slots in the mmWave-enabled VR network and conduct the system modelling including uplink and downlink transmission models, FoP model, and power consumption model.

Refer to caption — Figure 1: A mmWave-enabled VR network incorporating CoMP transmission.

II-A Uplink and downlink transmission models

II-A1 Uplink transmission model

Denote ${\bm{x}}_{it}^{\rm 3D}=[x_{it},y_{it},h_{i}]^{\rm T}$ as the three dimensional (3D) Cartesian coordinate of the HMD worn by user $i$ for all $i\in{\mathcal{U}}$ at time slot $t$ and $h_{i}\sim{\mathcal{N}}(\bar{h},\sigma_{h}^{2})$ is the user height. $[x_{it},y_{it}]^{\rm T}$ is the two dimensional (2D) location of user $i$ at time slot $t$ . Denote ${{\bm{v}}_{j}^{\rm 3D}}=[x_{j},y_{j},H_{j}]^{\rm T}$ as the 3D coordinate of the antenna of AP $j$ and $H_{j}$ is the antenna height. Owing to the reliability requirement, users’ data information (e.g., users’ tracking information and profiles) is required to be successfully decoded by corresponding APs. We express the condition that an AP can successfully decode the received user data packets as follows

SNR_{ijt}^{\rm ul}=\frac{a_{ijt}^{\rm ul}p_{it}c_{ij}{\hat{h}_{ijt}}}{{N_{0}}W^{\rm ul}/N}\geq{\theta^{\rm th}},\forall i,j,t,

(1)

where $a_{ijt}^{\rm ul}\in\{0,1\}$ is an association variable indicating whether user $i$ ’s uplink data packets can be successfully decoded by AP $j$ at time slot $t$ . The data packets can be decoded if $a_{ijt}^{\rm ul}=1$ ; otherwise, $a_{ijt}^{\rm ul}=0$ . $p_{it}$ is the uplink transmit power of user $i$ ’s HMD, $c_{ij}$ is the Rayleigh channel gain, $\hat{h}_{ijt}={d_{ijt}^{-\alpha}({{\bm{x}}_{it}^{\rm 3D}},{{\bm{v}}_{j}^{\rm 3D}})}$ is the uplink path-loss from user $i$ to AP $j$ with $\alpha$ being the fading exponent, $d_{ijt}(\cdot)$ denotes the Euclidean distance between user $i$ and AP $j$ , $N_{0}$ denotes the single-side noise spectral density, $W^{\rm ul}$ represents the uplink bandwidth. $\theta^{\rm th}$ is the target signal-to-noise ratio (SNR) experienced at AP $j$ for successfully decoding data packets from user $i$ . Besides, considering the reliability requirement of uplink transmission and the stringent power constraint on HMDs, frequency division multiplexing (FDM) technique is adopted in this paper. The adoption of FDM technique can avoid the decoding failure resulting from uplink signal interferences and significantly reduce power consumption without compensating the signal-to-interference-plus-noise ratio (SINR) loss caused by uplink interferences.

Additionally, we assume that each user $i$ can connect to at most one AP $j$ via the uplink channel at each time slot $t$ , i.e., $\sum\nolimits_{j\in{\mathcal{J}}}{a_{ijt}^{\rm ul}}\leq 1$ , $\forall i$ . This is reasonable because it is unnecessary for each AP to decode all users’ data successfully at each time slot $t$ . A user merely connects to an AP (e.g., the nearest AP if possible) will greatly reduce power consumption. Meanwhile, considering the stringent low-latency requirements of VR applications and the time consumption of processing (e.g., decoding and checking) received user data packets, we assume that an AP can serve up to $\tilde{M}$ users during a time slot, i.e., $\sum\nolimits_{i\in{\mathcal{U}}}{a_{ijt}^{\rm ul}}\leq\tilde{M}$ , $\forall j$ .

II-A2 Downlink transmission model

In the downlink transmission configuration, antenna arrays are deployed to perform directional beamforming. For analysis facilitation, a sectored antenna model [23], which consists of four components, i.e., the half-power beamwidth $\phi$ , the antenna downtilt angle $\theta_{j}$ $\forall j$ , the antenna gain of the mainlobe $G$ , and the antenna gain of the sidelobe $g$ , shown in Fig. 2, is exploited to approximate actual array beam patterns. The antenna gain of the transmission link from AP $j$ to user $i$ is

{f_{ijt}}=\left\{{\begin{array}[]{*{20}{c}}G&{\angle B_{j}C_{j}D_{it}\leq\frac{\phi}{2},}\\ g&{\rm otherwise,}\end{array}}\right.\forall i,j,t,

(2)

where $\angle B_{j}C_{j}D_{it}$ represents user $i$ ’s tilt angle towards AP $j$ , the location of the point ‘ $B_{j}$ ’ can be determined by AP $j$ ’s 2D coordinate $\bm{v}_{j}^{\rm 2D}=[x_{j},y_{j}]^{\rm T}$ and $\theta_{j}$ , the point ‘ $D_{it}$ ’ represent user $i$ ’s position, the point ‘ $C_{j}$ ’ denotes the position of AP $j$ ’s antenna.

For any AP $j$ , the 2D coordinate $\bm{x}_{bj}^{\rm 2D}=[x_{bj},y_{bj}]^{\rm T}$ of point ‘ $B_{j}$ ’ can be given by

{x_{bj}}={d_{j}}({x_{o}}-{x_{j}})/{r_{j}}+{x_{j}},\forall j,

(3)

{y_{bj}}={d_{j}}({y_{o}}-{y_{j}})/{r_{j}}+{y_{j}},\forall j,

(4)

where $d_{j}=H_{j}/\tan(\theta_{j})$ , $r_{j}=||\bm{x}_{o}-\bm{v}_{j}^{\rm 2D}||_{2}$ , and $\bm{x}_{o}=[x_{o},y_{o}]^{\rm T}$ is 2D coordinate of the center point of the considered communication area.

Then, user $i$ ’s tilt angle towards AP $j$ can be written as

\angle B_{j}C_{j}D_{it}=\arccos\left(\frac{{\overrightarrow{{C_{j}}{B_{j}}}\cdot\overrightarrow{{C_{j}}{D_{it}}}}}{{{{\left\|{{C_{j}}{B_{j}}}\right\|}_{2}}{{\left\|{{C_{j}}{D_{it}}}\right\|}_{2}}}}\right),\forall i,j,t,

(5)

where direction vectors ${\overrightarrow{{C_{j}}{B_{j}}}}=(x_{bj}-x_{j},y_{bj}-y_{j},-H_{j})$ and ${\overrightarrow{{C_{j}}{D_{it}}}}=(x_{it}-x_{j},y_{it}-y_{j},h_{i}-H_{j})$ .

A mmWave link may be blocked if a user turns around; this is because the user wears an HMD in front of his/her forehead. Denote $\vartheta$ as the maximum angle within which an AP can experience LoS transmission towards its downlink associated users. For user $i$ at time slot $t$ , an indicator variable $b_{ijt}$ introduced to indicate the blockage effect of user $i$ ’s body is given by

{b_{ijt}}=\left\{{\begin{array}[]{*{20}{c}}1&\angle({\vec{A}_{jit}},{\vec{x}_{it}})>\vartheta,\\ 0&{\rm otherwise,}\end{array}}\right.\forall i,j,t,

(6)

where $\angle({\vec{A}_{jit}},{\vec{x}_{it}})$ represents the orientation angle of user $i$ at time slot $t$ , which can be determined by locations of both user $i$ and AP $j$ ,¹¹1In this paper, we consider the case of determining users’ orientation angles via the locations of both APs and users. Certainly, our proposed learning method is also applicable to scenarios where users’ orientation angles need to be predicted. ${\vec{x}}_{it}=(x_{it}-x_{it-1},y_{it}-y_{it-1})$ is a direction vector. When $t=1$ , the direction vector ${\vec{x}}_{i1}=(x_{i1},y_{i1})$ . ${\vec{A}_{jit}}=(x_{j}-x_{it},y_{j}-y_{it})$ is a direction vector between the AP $j$ and user $i$ .

Given ${\vec{A}_{jit}}$ and ${\vec{x}_{it}}$ , we can calculate the orientation angle of user $i$ that is also the angle between ${\vec{A}_{jit}}$ and ${\vec{x}}_{it}$ by

\angle({\vec{A}_{jit}},{\vec{x}_{it}})=\arccos\left(\frac{{\vec{A}_{jit}}\cdot{\vec{x}_{it}}}{||{\vec{A}_{jit}}||_{2}||{\vec{x}_{it}}||_{2}}\right),\forall i,j,t.

(7)

The channel gain coefficient $h_{ijkt}$ of an LoS link and a non line-of-sight (NLoS) link between the $k$ -th antenna element of AP $j$ and user $i$ at time slot $t$ can take the form [23]

\begin{array}[]{l}10{\log_{10}}({h_{ijkt}}h_{ijkt}^{\rm{H}})=\left\{{\begin{array}[]{*{20}{l}}{\begin{array}[]{*{20}{l}}{10{\eta_{{\rm{LoS}}}}{{\log}_{10}}({d_{ijt}}(x_{it}^{{\rm{3D}}},v_{j}^{{\rm{3D}}}))}{+20{{\log}_{10}}\left({\frac{{4\pi{f_{c}}}}{c}}\right)+}\\ \qquad\qquad{10{{\log}_{10}}{f_{ijt}}+\mu_{k}^{{\rm{LoS}}},}\end{array}}&{{b_{ijt}}=0}\\ {\begin{array}[]{*{20}{l}}{10{\eta_{{\rm{NLoS}}}}{{\log}_{10}}({d_{ijt}}(x_{it}^{{\rm{3D}}},v_{j}^{{\rm{3D}}}))}{+20{{\log}_{10}}\left({\frac{{4\pi{f_{c}}}}{c}}\right)+}\\ \qquad\qquad{10{{\log}_{10}}{f_{ijt}}+\mu_{k}^{{\rm{NLoS}}},}\end{array}}&{{b_{ijt}}=1}\end{array}}\right.\forall i,j,k,t,\end{array}

(8)

where $f_{c}$ (in Hz) is the carrier frequency, $c$ (in m/s) the light speed, $\eta_{\rm LoS}$ (in dB) and $\eta_{\rm NLoS}$ (in dB) the path-loss exponents of LoS and NLoS links, respectively, $\mu_{{\rm LoS}}\sim{\mathcal{CN}}(0,\sigma_{\rm LoS}^{2})$ (in dB) and $\mu_{{\rm NLoS}}\sim{\mathcal{CN}}(0,\sigma_{\rm NLoS}^{2})$ (in dB).

For any user $i$ , to satisfy its immersive experience requirement, its downlink achievable data rate (denoted by ${r_{it}^{{\rm{dl}}}}$ ) from cooperative APs should be no less than a data rate threshold $\gamma^{\rm th}$ , i.e.,

{r_{it}^{\rm dl}\geq{\gamma^{\rm th}}},\text{ }\forall i,t.

(9)

Define $a_{it}^{\rm dl}\in\{0,1\}$ as an association variable indicating whether the user $i$ ’s data rate requirement can be satisfied at time slot $t$ . $a_{it}^{\rm dl}=1$ indicates that its data rate requirement can be satisfied; otherwise, $a_{it}^{\rm dl}=0$ . Then, for any user $i$ at time slot $t$ , according to Shannon capacity formula and the principle of CoMP transmission, we can calculate ${r_{it}^{\rm dl}}$ by

{r_{it}^{{\rm{dl}}}={W^{{\rm{dl}}}}{{\log}_{2}}\left({1+\frac{{{a_{it}^{\rm dl}}{{|\sum\nolimits_{j\in{\mathcal{J}}}{{\bm{h}_{ijt}^{\rm H}}{\bm{g}_{ijt}}}|}^{2}}}}{{{N_{0}}{W^{{\rm{dl}}}}+I_{it}^{\rm dl}}}}\right)},\text{ }\forall i,t,

(10)

where $\bm{h}_{ijt}=[h_{ij1t},\ldots,h_{ijKt}]^{\rm T}\in{\mathbb{C}}^{K}$ is a channel gain coefficient vector with $K$ denoting the number of antenna elements, $\bm{g}_{ijt}\in{\mathbb{C}}^{K}$ is the transmit beamformer pointed at user $i$ from AP $j$ , $W^{\rm dl}$ represents the downlink system bandwidth. Owing to the directional propagation, for user $i$ , not all users will be its interfering users. It is regarded that users whose distances from user $i$ are small than $D^{\rm th}$ will be user $i$ ’s interfering users, where $D^{\rm th}$ is determined by antenna configuration of APs (e.g., antenna height and downtilt angle). Denote the set of interfering users of user $i$ at time slot $t$ by ${\mathcal{M}}_{it}$ , then, we have $I_{it}^{\rm dl}=\sum\nolimits_{m\in{\mathcal{M}}_{it}}{a_{mt}^{\rm dl}{{{|\sum\nolimits_{j\in{\mathcal{J}}}{{\bm{h}_{mjt}^{\rm H}}{\bm{g}_{mjt}}}|}^{2}}}}$ .

II-B Feeling of presence model

In VR applications, FoP represents an event that does not drag users back from engaging and immersive fictitious environments [24]. For wireless VR, the degrading FoP can be caused by the collection of inaccurate users’ tracking information via APs and the reception of low-quality VR video frames. Therefore, we consider the uplink user tracking information transmission and downlink VR video delivery when modelling the FoP experienced by users. Mathematically, over a period of time slots, we model the FoP experienced by users as the following

\bar{B}(T)=\frac{1}{T}\sum\nolimits_{t=1}^{T}{\left({B_{t}^{\rm ul}\left({\bm{a}_{t}^{\rm ul}}\right)+B_{t}^{\rm dl}\left({\bm{a}_{t}^{\rm dl}}\right)}\right)},

(11)

where ${{B}}_{t}^{{\rm{ul}}}\left({\bm{a}_{t}^{{\rm{ul}}}}\right)=\frac{1}{{{N}}}\sum\limits_{i\in{\mathcal{U}}}{\sum\limits_{j\in{\mathcal{J}}}{a_{ijt}^{{\rm{ul}}}}}$ with $\bm{a}_{t}^{\rm ul}=[a_{11t}^{\rm ul},\ldots,a_{ijt}^{\rm ul},\ldots,a_{NJt}^{\rm ul}]^{\rm T}$ , $B_{t}^{{\rm{dl}}}\left({\bm{a}_{t}^{{\rm{dl}}}}\right)=\frac{1}{N}{\sum\limits_{i\in{{\mathcal{U}}}}{a_{it}^{{\rm{dl}}}}}$ with $\bm{a}_{t}^{\rm dl}=[a_{1t}^{\rm dl},\ldots,a_{it}^{\rm dl},\ldots,a_{Nt}^{\rm dl}]^{\rm T}$ .

II-C Power consumption model

HMDs are generally battery-driven and constrained by the maximum instantaneous power. For any user $i$ ’s HMD, define $p_{it}^{\rm tot}$ as its instantaneous power consumption including the transmit power and circuit power consumption (e.g., power consumption of mixers, frequency synthesizers, and digital-to-analog converters) at time slot $t$ , we then have

{p_{it}^{\rm tot}}\leq{\tilde{p}_{i}},\forall i,t,

(12)

where $p_{it}^{\rm tot}={p_{it}}+p_{i}^{c}$ , $p_{i}^{c}$ denotes the HMD’s circuit power consumption during a time slot, and $\tilde{p}_{i}$ is a constant. Without loss of generality, we assume that all users’ HMDs are homogenous.

The instantaneous power consumption of each AP is also constrained. As CoMP transmission technique is explored, for any AP $j$ , we can model its instantaneous power consumption at time slot $t$ as the following

{\sum\nolimits_{i\in{\mathcal{U}}}{a_{it}^{\rm dl}\bm{g}_{ijt}^{\rm H}\bm{g}_{ijt}}}+E_{j}^{c}\leq\tilde{E}_{j},\forall j,t,

(13)

where $E_{j}^{c}$ is a constant representing the circuit power consumption, $\tilde{E}_{j}$ is the maximum instantaneous power of AP $j$ .

II-D Objective function and problem formulation

To guarantee immersive and interactive VR experiences of users over a period of time slots, uplink user data packets should be successfully decoded, and downlink data rate requirements of users should be satisfied at each time slot; that is, users’ FoP should be maximized. According to (1) and (11), one might believe that increasing the transmit power of users’ HMDs would be an appropriate way of enhancing users’ FoP. However, as users’ HMDs are usually powered by batteries, they are encouraged to work in an energy-efficient mode to prolong their working duration. Further, reducing HMDs’ power consumption indicates less heat generation, which can enhance users’ VR experiences. Therefore, our goal is to maximize users’ FoP while minimizing the power consumption of HMDs over a period of time slots. Combining with the above analysis, we can formulate the problem of enhancing users’ immersive experiences as below


	$\displaystyle\mathop{\rm maximize}\limits_{\{\bm{a}_{t}^{\rm ul},\bm{a}_{t}^{\rm dl},\bm{p}_{t},\bm{g}_{ijt}\}}\text{ }\mathop{\lim\inf}\limits_{T\to\infty}\frac{1}{T}\sum\limits_{t=1}^{T}{\left(B_{t}^{{\rm{ul}}}\left({\bm{a}_{t}^{{\rm{ul}}}}\right)+B_{t}^{{\rm{dl}}}\left({\bm{a}_{t}^{{\rm{dl}}}}\right)\right)}-\frac{1}{T}\sum\limits_{t=1}^{T}{\sum\limits_{i\in{\mathcal{U}}}{\sum\limits_{j\in{\cal J}}{a_{ijt}^{{\rm{ul}}}p_{it}^{{\rm{tot}}}/{{\tilde{p}}_{i}}}}}\allowdisplaybreaks[4]$		(14a)
	$\displaystyle{\rm s.t.}\quad\sum\nolimits_{j\in{\mathcal{J}}}{a_{ijt}^{\rm ul}}\leq 1,\forall i,t\allowdisplaybreaks[4]$		(14b)
	$\displaystyle\qquad\sum\nolimits_{i\in{\mathcal{U}}}{a_{ijt}^{\rm ul}}\leq\tilde{M},\forall j,t$		(14c)
	$\displaystyle\qquad{a_{ijt}^{\rm ul}}\in\{0,1\},\forall i,j,t$		(14d)
	$\displaystyle\qquad{a_{it}^{\rm dl}}\in\{0,1\},\forall i,t$		(14e)
	$\displaystyle\qquad 0\leq p_{it}\leq\tilde{p}_{i}-p_{i}^{c},\forall i,t$		(14f)
	$\displaystyle\qquad\rm{(\ref{eq:SINR_condition}),(\ref{eq:data_rate_condition}),(\ref{eq:power_consumption}),}$		(14g)

where $\bm{p}_{t}=[p_{1t},p_{2t},\ldots,p_{Nt}]^{\rm T}$ .

However, the solution to (14) is highly challenging due to the unknown users’ tracking information at each time slot. Given users’ tracking information, the solution to (14) is still NP-hard or even non-detectable. It can be confirmed that (14) is a mixed-integer non-linear programming (MINLP) problem as it simultaneously contains zero-one variables, continuous variables, and non-linear constraints. Further, we can know that (9) and (13) are non-convex with respect to (w.r.t) $a_{it}^{\rm dl}$ and ${\bm{g}}_{ijt}$ , $\forall i$ , $j$ , by evaluating the Hessian matrix. To tackle the tricky problem, we develop a novel solution framework as depicted in Fig. 4. In this framework, we first propose to predict users’ tracking information using a machine learning method. With the predicted results, we then develop a DRL-based optimization algorithm to solve the MINLP problem. The procedure of solving (14) is elaborated in the following sections.

III Users’ Location Prediction

As analyzed above, the efficient user-AP association and transmit power of both HMDs and APs are configured on the basis of the accurate perception of users’ tracking information. If the association and transmit power are identified without knowledge of users’ tracking information, users may have degrading VR experiences, and the working duration of users’ HMDs may be dramatically shortened. Meanwhile, owing to the stringent low latency requirement, the user-AP association and transmit power should be proactively determined to enhance users’ immersive and interactive VR experiences. Hence, APs must collect fresh and historical tracking information for users’ tracking information prediction in future time slots. With predicted tracking information, the user-AP association and transmit power can be configured in advance. Certainly, from (7), we observe that users’ orientation angles can be obtained by their and APs’ locations; thus, we only predict users’ locations in this section. Machine learning is convinced as a promising proposal to predict users’ locations. In machine learning methods, the accuracy and completeness of sample collection are crucial for accurate model training. However, the user-AP association may vary with user movement, which indicates that location information of each user may scatter in multiple APs, and each AP may only collect partial location information of its associated users after a period of time. To tackle this issue, we develop a parallel machine learning method, which exploits $J$ slave virtual machines (VMs) created in the CU to train learning models for each user, as shown in Fig. 4. Besides, for each AP, it will feed its locally collected location information to a slave VM for training. In this way, the prediction process can also be accelerated. With the predicted results, the CU can then proactively allocate system resources by solving (14).

III-A Echo state network

In this section, the principle of echo state network (ESN) is exploited to train users’ location prediction model as the ESN method can efficiently analyze the correlation of users’ location information and quickly converge to obtain users’ predicted locations [25]. It is noteworthy that there are some differences between the traditional ESN method and the developed parallel ESN learning method. The traditional ESN method is a centralized learning method with the requirement of the aggregation of all users’ locations scattered in all APs, which is not required for the parallel ESN learning method. What’s more, the traditional ESN method can only be used to conduct data prediction in a time slot while the parallel ESN learning method can predict users’ locations in $M>1$ time slots. An ESN is a recurrent neural network that can be partitioned into three components: input, ESN model, and output, as shown in Fig. 4. For any user $i\in{\mathcal{U}}$ , the $N_{i}$ -dimensional input vector $\bm{x}_{it}\in{\mathbb{R}}^{N_{i}}$ is fed to an $N_{r}$ -dimensional reservoir whose internal state $\bm{s}_{i(t-1)}\in{\mathbb{R}}^{N_{r}}$ is updated according to the state equation

\bm{s}_{it}=\tanh\left(\bm{W}_{\rm in}^{r}\bm{x}_{it}+\bm{W}_{\rm r}^{r}\bm{s}_{i(t-1)}\right),

(15)

where $\bm{W}_{\rm in}^{r}\in{\mathbb{R}}^{N_{r}\times N_{i}}$ and $\bm{W}_{\rm r}^{r}\in{\mathbb{R}}^{N_{r}\times N_{r}}$ are randomly generated matrices with each matrix element locating in the interval $(0,1)$ .

The evaluated output of the ESN at time slot $t$ is given by

\hat{\bm{y}}_{i(t+1)}=\bm{W}_{\rm in}^{o}\bm{x}_{it}+\bm{W}_{\rm r}^{o}\bm{s}_{it},

(16)

where $\bm{W}_{\rm in}^{o}\in{\mathbb{R}}^{N_{o}\times N_{i}}$ , $\bm{W}_{\rm r}^{o}\in{\mathbb{R}}^{N_{o}\times N_{r}}$ are trained based on collected training data samples.

To train the ESN model, suppose we are provided with a sequence of $Q$ desired input-outputs pairs $\{(\bm{x}_{i1},{\bm{y}}_{i1}),\ldots,(\bm{x}_{iQ},{\bm{y}}_{iQ})\}$ of user $i$ , where $\bm{y}_{it}\in{\mathbb{R}}^{N_{o}}$ is the target location of user $i$ at time slot $t$ . Define the hidden matrix $\bm{X}_{it}$ as

{\bm{X}_{it}}=\left[{\begin{array}[]{*{20}{c}}\begin{array}[]{l}{\bm{x}_{i1}}\\ {\bm{s}_{i1}}\end{array}&\cdots&\begin{array}[]{l}{\bm{x}_{iQ}}\\ {\bm{s}_{iQ}}\end{array}\end{array}}\right].

(17)

The optimal output weight matrix is then achieved by solving the following regularized least-square problem

\bm{W}_{t}^{\star}=\mathop{\arg\min}\limits_{\bm{W}_{t}\in{{\mathbb{R}}^{({N_{i}}+{N_{r}})\times{N_{o}}}}}\text{ }\frac{1}{Q}l\left(\bm{X}_{it}^{\rm T}\bm{W}_{t}\right)+\xi r(\bm{W}_{t})

(18)

where $\bm{W}_{t}=[\bm{W}_{\rm in}^{o}\bm{W}_{\rm r}^{o}]^{\rm T}$ , $\xi\in{\mathbb{R}}_{+}$ is a positive scalar known as regularization factor, the loss function $l(\bm{X}_{it}^{\rm T}\bm{W}_{t})=\frac{1}{2}||{\bm{X}_{it}^{\rm T}\bm{W}_{t}-\bm{Y}_{it}}||_{F}^{2}$ , the regulator $r(\bm{W}_{t})=||\bm{W}_{t}||_{F}^{2}$ , and the target location matrix $\bm{Y}_{it}=[\bm{y}_{i1}^{\rm T};\ldots;\bm{y}_{iQ}^{\rm T}]\in{\mathbb{R}}^{Q\times N_{o}}$ .

III-B Parallel ESN learning method for users’ location prediction

Based on the principle of the ESN method, we next elaborate on the procedure of the parallel ESN learning method for users’ location prediction. To facilitate the analysis, we make the following assumptions on the regulator and the loss function.

Assumption 1.

The function $r:{\mathbb{R}}^{m\times n}\to{\mathbb{R}}$ is $\zeta$ -strongly convex, i.e., $\forall i\in\{1,2,\ldots,n\}$ , $\forall\bm{X}$ , and $\Delta\bm{X}\in{\mathbb{R}}^{m\times n}$ , we have [26]

r(\bm{X}+\Delta\bm{X})\geq r(\bm{X})+\nabla r(\bm{X})\odot\Delta\bm{X}+{\zeta}||\Delta\bm{X}||_{F}^{2}/{2},

(19)

where $\nabla r(\cdot)$ denotes the gradient of $r(\cdot)$ .

Assumption 2.

The function $l:{\mathbb{R}}\to{\mathbb{R}}$ are $\frac{1}{\mu}$ -smooth, i.e., $\forall i\in\{1,2,\ldots,n\}$ , $\forall x$ , and $\Delta x\in{\mathbb{R}}$ , we have

l(x+\Delta x)\leq l(x)+\nabla l(x)\Delta x+(\Delta x)^{2}/{2\mu},

(20)

where $\nabla l(\cdot)$ represents the gradient of $l(\cdot)$ .

According to Fenchel-Rockafeller duality, we can formulate the local dual optimization problem of (18) in the following way.

Lemma 1.

For a set of $J$ slave VMs and a typical user $i$ , the dual problem of (18) can be written as follows

\mathop{\rm maximize}\limits_{\bm{A}\in{{\mathbb{R}}^{Q\times{N_{o}}}}}\text{ }\left\{{-\xi r^{\star}\left({\frac{1}{{\xi Q}}{\bm{A}^{\rm T}}{\bm{X}^{\rm T}}}\right)-\frac{1}{Q}\sum\limits_{m=1}^{Q}{\sum\limits_{n=1}^{{N_{o}}}{l^{\star}(-{a_{mn}})}}}\right\}

(21)

where

{r^{\star}}(\bm{C})=\frac{1}{4}{\sum\nolimits_{n=1}^{N_{o}}{\bm{z}_{n}^{\rm T}\bm{C}\bm{C}^{\rm T}\bm{z}_{n}}},

(22)

l^{\star}(-a_{mn})={-a_{mn}y_{mn}+a_{mn}^{2}/{2}},

(23)

$\bm{A}\in{\mathbb{R}}^{Q\times N_{o}}$ is a Lagrangian multiplier matrix, $\bm{z}_{n}\in{\mathbb{R}}^{N_{o}}$ is a column vector with the $n$ -th element being one and all other elements being zero, $\bm{X}$ is a lightened notation of ${\bm{X}_{it}}=$ $\left[{\begin{array}[]{*{20}{c}}\begin{array}[]{l}{\bm{x}_{i(t-1)}}\\ {\bm{s}_{i(t-1)}}\end{array}&\cdots&\begin{array}[]{l}{\bm{x}_{i(t-Q)}}\\ {\bm{s}_{i(t-Q)}}\end{array}\end{array}}\right]$ , and $y_{mn}$ is an element of matrix ${\bm{Y}}=[\bm{y}_{it}^{\rm T};\ldots;\bm{y}_{i(t-Q+1)}^{\rm T}]$ at the location of the $m$ -th row and the $n$ -th column.

Proof.

Please refer to Appendix A. ∎

Denote the objective function of (21) as $D(\bm{A})$ , and define $\bm{V}(\bm{A}):=\frac{1}{\xi Q}{(\bm{X}\bm{A})^{\rm T}}\in{\mathbb{R}}^{N_{o}\times(N_{i}+N_{r})}$ , we can then rewrite $D(\bm{A})$ as

D(\bm{A})=-\xi r^{\star}(\bm{V}(\bm{A}))-\sum\nolimits_{j=1}^{J}{R_{j}(\bm{A}_{[j]})},

(24)

where $R_{j}(\bm{A}_{[j]})=\frac{1}{Q}\sum\limits_{m\in{\mathcal{Q}}_{j}}{\sum\limits_{{n}=1}^{N_{o}}{l^{\star}(-{a_{mn}})}}$ , $\bm{A}_{[j]}=\hat{\bm{Z}}_{j}\bm{A}$ , and $\hat{\bm{Z}}_{j}\in{\mathbb{R}}^{Q\times Q}$ is a square matrix with $J\times J$ blocks. In $\hat{\bm{Z}}_{j}$ , the block in the $j$ -th row and $j$ -th column is a $q_{j}\times q_{j}$ identity matrix with $q_{j}$ being the cardinality of a set ${\mathcal{Q}}_{j}$ and all other blocks are zero matrices, ${\mathcal{Q}}_{j}$ is an index set including the indices of $Q$ data samples fed to slave VM $j$ .

Then, for a given matrix ${\bm{A}}^{t}$ , varying its value by $\Delta{\bm{A}}^{t}$ will change (24) as below

\begin{array}[]{l}D({\bm{A}}^{t}+\Delta{\bm{A}}^{t})=-\xi r^{\star}(\bm{V}({\bm{A}}^{t}+\Delta{\bm{A}}^{t}))-\sum\nolimits_{j=1}^{J}{R_{j}({\bm{A}}_{[j]}^{t}+\Delta{\bm{A}}_{[j]}^{t})},\end{array}

(25)

where $\Delta{\bm{A}}_{[j]}^{t}=\hat{\bm{Z}}_{j}\Delta{\bm{A}}^{t}$ .

Note that the second term of the right-hand side (RHS) of (25) includes the local changes of each VM $j$ , while the first term involves the global variations.

As $r(\cdot)$ is $\zeta$ -strongly convex, $r^{\star}(\cdot)$ is then $\frac{1}{\zeta}$ -smooth [26]. Thus, we can calculate the upper bound of $r^{\star}(V({\bm{A}}^{t}+\Delta{\bm{A}}^{t}))$ as follows

\begin{array}[]{l}r^{\star}(V({\bm{A}}^{t}+\Delta{\bm{A}}^{t}))\leq r^{\star}\left({\bm{V}({\bm{A}}^{t})}\right)+\frac{1}{\xi Q}\sum\limits_{n=1}^{{N_{o}}}\bm{z}_{n}^{\rm T}{\nabla r^{\star}(\bm{V}({\bm{A}}^{t}))\bm{X}\Delta{\bm{A}}^{t}{\bm{z}_{n}}}+\frac{\kappa}{{2{{\left({\xi Q}\right)}^{2}}}}\sum\limits_{n=1}^{{N_{o}}}{{{\left\|{\bm{X}\Delta{\bm{A}}^{t}{\bm{z}_{n}}}\right\|}^{2}}}\\ =r^{\star}\left({\bm{V}({\bm{A}}^{t})}\right)+\frac{1}{\xi Q}\sum\limits_{j=1}^{J}{\sum\limits_{n=1}^{{N_{o}}}\bm{z}_{n}^{\rm T}{\nabla r^{\star}(\bm{V}({\bm{A}}^{t}))\bm{X}_{[j]}\Delta{\bm{A}}_{[j]}^{t}{\bm{z}_{n}}}}+\frac{\kappa}{{2{{\left({\xi Q}\right)}^{2}}}}\sum\limits_{j=1}^{J}{\sum\limits_{n=1}^{{N_{o}}}{{{\left\|{\bm{X}_{[j]}\Delta{\bm{A}}_{[j]}^{t}{\bm{z}_{n}}}\right\|}^{2}}}},\end{array}

(26)

where $\bm{X}_{[j]}=\bm{X}\hat{\bm{Z}}_{j}$ , $\kappa>\frac{1}{\zeta}$ is a data dependent constant measuring the difficulty of the partition to the whole samples.

By substituting (26) into (25), we obtain

\begin{array}[]{l}D({\bm{A}}^{t}+\Delta{\bm{A}}^{t})\geq-{\xi}r^{\star}\left({\bm{V}({\bm{A}}^{t})}\right)-\frac{1}{Q}\sum\limits_{j=1}^{J}{\sum\limits_{n=1}^{{N_{o}}}\bm{z}_{n}^{\rm T}{\nabla r^{\star}(\bm{V}({\bm{A}}^{t}))\bm{X}_{[j]}\Delta{\bm{A}}_{[j]}^{t}{\bm{z}_{n}}}}\\ \qquad\qquad\qquad\quad-\frac{\kappa}{{2\xi{{{Q}}^{2}}}}\sum\limits_{j=1}^{J}{\sum\limits_{n=1}^{{N_{o}}}{{{\left\|{\bm{X}_{[j]}\Delta{\bm{A}}_{[j]}^{t}{\bm{z}_{n}}}\right\|}^{2}}}}-\sum\limits_{j=1}^{J}{R_{j}({\bm{A}}_{[j]}^{t}+\Delta{\bm{A}}_{[j]}^{t})}.\end{array}

(27)

From (27), we observe that the problem of maximizing $D({\bm{A}}^{t}+\Delta{\bm{A}}^{t})$ can be decomposed into $J$ subproblems, and $J$ slave VMs can then be exploited to optimize these subproblems separately. If slave VM $j$ can optimize $\Delta{\bm{A}}^{t}$ using its collected data samples by maximizing the RHS of (27), the resultant improvements can be aggregated to drive $\bm{D}({\bm{A}}^{t})$ toward the optimum. The detailed procedure is described below.

As shown in Fig. 4, during any communication round $t$ , a master VM produces $\bm{V}(\bm{A}^{t})$ using updates received at the last round and shares it with all slave VMs. The task at any slave VM $j$ is to obtain $\Delta\bm{A}_{[j]}^{t}$ by maximizing the following problem

\begin{array}[]{l}\Delta{\bm{A}_{[j]}^{t\star}}=\mathop{\arg\max}\limits_{\Delta{\bm{A}_{[j]}^{t}}\in{{\mathbb{R}}^{Q\times{N_{o}}}}}\text{ }\Delta{D_{j}}\left({\Delta{\bm{A}_{[j]}^{t}};\bm{V}({\bm{A}^{t}}),{\bm{A}_{[j]}^{t}}}\right)\\ =\mathop{\arg\max}\limits_{\Delta{\bm{A}_{[j]}^{t}}\in{{\mathbb{R}}^{Q\times{N_{o}}}}}\left\{{-{R_{j}}\left({{\bm{A}_{[j]}^{t}}+\Delta{\bm{A}_{[j]}^{t}}}\right)-\frac{\xi}{J}r^{\star}(\bm{V}({\bm{A}^{t}}))}\right.\left.-\frac{1}{Q}\sum\limits_{n=1}^{{N_{o}}}\bm{z}_{n}^{\rm T}{\nabla r^{\star}(\bm{V}({\bm{A}^{t}}))\bm{X}_{[j]}\Delta{\bm{A}_{[j]}^{t}}{\bm{z}_{n}}}-\right.\\ \quad\left.\frac{\kappa}{{2\xi{Q^{2}}}}\sum\limits_{n=1}^{{N_{o}}}{{{\left\|{\bm{X}_{[j]}\Delta{\bm{A}_{[j]}^{t}}{\bm{z}_{n}}}\right\|}^{2}}}\right\}.\end{array}

(28)

Calculate the derivative of $\Delta{D_{j}}\left({\Delta{\bm{A}_{[j]}^{t}};\bm{V}({\bm{A}^{t}}),{\bm{A}_{[j]}^{t}}}\right)$ over $\Delta{\bm{A}_{[j]}^{t}}$ , and force the derivative result to be zero, we have

\begin{array}[]{l}\Delta{\bm{A}}_{[j]}^{t\star}={\left({{{{{\hat{\bm{Z}}_{j}}}}}+\frac{\kappa}{{\xi Q}}\bm{X}_{[j]}^{\rm T}{\bm{X}_{[j]}}}\right)^{-1}}\left({{\bm{Y}_{[j]}}-\bm{A}_{[j]}^{t}-\frac{1}{2}\bm{X}_{[j]}^{\rm T}{\bm{V}^{\rm T}}({\bm{A}_{t}})}\right),\end{array}

(29)

where $\bm{Y}_{[j]}=\hat{\bm{Z}}_{j}\bm{Y}$ .

Next, slave VM $j$ , $\forall j$ , sends $\Delta\bm{V}_{[j]}^{t}=\frac{1}{{\xi Q}}{{{({\bm{X}_{[j]}\Delta{\bm{A}_{[j]}^{t\star}}})}^{\rm T}}}$ to the master VM. The master VM updates the global model as $\bm{V}(\bm{A}^{t}+\Delta\bm{A}^{t})=\bm{V}({\bm{A}^{t}})+\sum\nolimits_{j=1}^{J}{\Delta\bm{V}_{[j]}^{t}}$ . Finally, alteratively update $\bm{V}(\bm{A}^{t})$ and $\{\Delta\bm{A}_{[j]}^{t\star}\}_{j=1}^{J}$ on the global and local sides, respectively. It is expected that the solution to the dual problem can be enhanced at every step and will converge after several iterations.

At time slot $t$ , based on the above derivation, the parallel ESN learning method for predicting locations of user $i$ , $\forall i$ , in $M$ time slots can be summarized in Algorithm 1.

Algorithm 1 Parallel ESN learning for user location prediction

1: Initialization: Data samples of all slave VMs. For any slave VM

j

, it randomly initiates a starting point

\bm{A}_{[j]}^{0}\in{\mathbb{R}}^{Q\times N_{o}}

. The master VM collects

\frac{1}{\xi Q}{(\bm{X}_{[j]}\bm{A}_{[j]}^{0})^{\rm T}}

from all slave VMs, generates

\bm{V}(\bm{A}^{0})=\sum_{j=1}^{J}{\frac{1}{\xi Q}{(\bm{X}_{[j]}\bm{A}_{[j]}^{0})^{\rm T}}}

, and then share the model

\bm{V}(\bm{A}^{0})

with all slave VMs. Let

\kappa=J/{\zeta}

2: for

r=0:\bar{r}_{\rm max}-1

3: for each slave VM

j\in\{1,2,\ldots,J\}

in parallel do

4: Calculate

\Delta{\bm{A}}_{[j]}^{r\star}

using (29), update and store the local Lagrangian multiplier

\bm{A}_{[j]}^{r+1}=\bm{A}_{[j]}^{r}+\Delta\bm{A}_{[j]}^{r\star}/({r+1}).

(30)

5: Compute the following local model and send it to the master VM

\Delta\bm{V}_{[j]}^{r}={{\left({\bm{X}_{[j]}\Delta{\bm{A}_{[j]}^{r\star}}}\right)^{\rm T}}}/{{\xi Q}}.

(31)

6: end for

7: Given local models, the master VM updates the global model as

\bm{V}(\bm{A}^{r+1})=\bm{V}(\bm{A}^{r})+\sum\nolimits_{j=1}^{J}{\Delta\bm{V}_{[j]}^{r}},

(32)

and then share the updated global model

\bm{V}(\bm{A}^{r+1})

with all slave VMs.

8: end for

9: Let

\bm{W}^{\rm T}={\nabla r^{\star}(\bm{V}({\bm{A}^{r}}))}

and predict user

i

’s location

\hat{\bm{y}}_{it}

by (16). Meanwhile, by iteratively assigning

\bm{x}_{i(t+1)}=\hat{\bm{y}}_{it}

, each user

i

’s locations in

M

time slots can be obtained.

10: Output: The predicted locations of user

i

\hat{\bm{Y}}_{it}=[\hat{\bm{y}}_{i(t+1)}^{\rm T};\ldots;\hat{\bm{y}}_{i(t+M)}^{\rm T}]

\forall i

IV DRL-based Optimization Algorithm

Given the predicted locations of all users, it is still challenging to solve the original problem owing to its non-linear and mixed-integer characteristics. Alternative optimization is extensively considered as an effective scheme of solving MINLP problems. Unfortunately, the popular alternative optimization scheme cannot be adopted in this paper. This is because the alternative optimization scheme is of often high computational complexity, and the original problem is also a sequential decision problem requiring an MINLP problem to be solved at each time slot. Remarkably, calling an optimization scheme with a high computational complexity at each time slot is unacceptable for latency-sensitive VR applications.

Reinforcement learning methods can be explored to solve sequential decision problems. For example, the works in [27, 28] proposed reinforcement learning methods to solve sequential decision problems with a discrete decision space and a continuous decision space, respectively. However, how to solve sequential decision problems simultaneously involving discrete and continuous decision variables (e.g., the problem (14)) is a significant and understudied problem.

In this paper, we propose a deep reinforcement learning (DRL)-based optimization algorithm to solve (14). Specifically, we design a DNN joint with an action quantization scheme to produce a set of association actions of high diversity. Given the association actions, a continuous optimization problem is solved to criticize them and optimize the continuous variables. The detailed procedure is presented in the following subsections.

IV-A Vertical decomposition

Define a vector $\bm{g}_{it}=[\bm{g}_{i1t};\ldots;\bm{g}_{ijt};\ldots;\bm{g}_{iJt}]\in{\mathbb{C}}^{JK}$ and a vector $\bm{h}_{it}=[f_{i1t}\bm{h}_{i1t};\ldots;f_{ijt}\bm{h}_{ijt};$ $\ldots;f_{iJt}\bm{h}_{iJt}]\in{\mathbb{C}}^{JK}$ , $\forall i$ , $t$ . Let matrix $\bm{G}_{it}=\bm{g}_{it}\bm{g}_{it}^{\rm T}$ and matrix $\bm{H}_{it}=\bm{h}_{it}\bm{h}_{it}^{\rm T}$ . As ${\rm tr}({\bm{AB}})={\rm tr}({\bm{BA}})$ for matrices $\bm{A}$ and $\bm{B}$ of compatible dimensions, the signal power received by user $i\in{\mathcal{U}}$ can be expressed as ${{{{|\sum\nolimits_{j\in{\mathcal{J}}}{{f_{it}}{\bm{h}_{it}^{\rm T}}{\bm{g}_{ijt}}}|}^{2}}}}={|{{\bm{h}}_{it}^{\rm T}{{\bm{g}}_{it}}}|^{2}}={\left({{\bm{h}}_{it}^{\rm T}{{\bm{g}}_{it}}}\right)^{\rm T}}{\bm{h}}_{it}^{\rm T}{{\bm{g}}_{it}}={\rm tr}({\bm{g}}_{it}^{\rm T}{{\bm{h}}_{it}}{\bm{h}}_{it}^{\rm T}{{\bm{g}}_{it}})={\rm tr}({{\bm{h}}_{it}}{\bm{h}}_{it}^{\rm T}{{\bm{g}}_{it}}{\bm{g}}_{it}^{\rm T})={\rm tr}({{\bm{H}}_{it}}{{\bm{G}}_{it}})$ . Likewise, by introducing a square matrix $\bm{Z}_{j}\in{\mathbb{R}}^{JK\times JK}$ with $J\times J$ blocks, the transmit power for serving users can be written as ${{\bm{g}}_{ijt}^{\rm T}{{\bm{g}}_{ijt}}}={\rm tr}({\bm{Z}}_{j}{{\bm{G}}_{it}})$ . Besides, each block in $\bm{Z}_{j}$ is a $K\times K$ matrix. In $\bm{Z}_{j}$ , the block in the $j$ -th row and $j$ -th column is a $K\times K$ identity matrix, and all other blocks are zero matrices. Then, by applying ${\bm{G}_{it}}={\bm{g}_{it}}{\bm{g}}_{it}^{\rm T}$ $\Leftrightarrow{\bm{G}_{it}}\succeq 0$ and ${{\rm rank}({\bm{G}_{it}})\leq 1}$ , we can convert (14) to the following problem


	$\displaystyle\mathop{{\rm{maximize}}}\limits_{\{\bm{a}_{t}^{\rm ul},\bm{a}_{t}^{\rm dl},\bm{p}_{t},\bm{G}_{it}\}}\text{ }\bar{B}(T)-\frac{1}{T}\sum\limits_{t=1}^{T}{\sum\limits_{i\in{\mathcal{U}}}{\sum\limits_{j\in{\mathcal{J}}}{a_{ijt}^{{\rm{ul}}}p_{it}^{{\rm{tot}}}/{{\tilde{p}}_{i}}}}}$		(33a)
	$\displaystyle{\rm s.t.}\quad{\log_{2}}\left({1+\frac{{a_{it}^{{\rm{dl}}}{\rm{tr}}({\bm{H}_{it}}{\bm{G}_{it}})}}{{{N_{0}}{W^{{\rm{dl}}}}+\sum\nolimits_{m\in{{\mathcal{M}}_{it}}}{a_{mt}^{{\rm{dl}}}{\rm{tr}}({\bm{H}_{mt}}{\bm{G}_{mt}})}}}}\right)\geq{{\gamma^{{\rm{th}}}}}/{{W^{{\rm{dl}}}}},\forall i,t$		(33b)
	$\displaystyle\qquad\sum\nolimits_{i\in{\mathcal{U}}}{a_{it}^{{\rm{dl}}}{\rm{tr}}({\bm{Z}_{j}}{\bm{G}_{it}})}+{{\tilde{E}}_{j}}\leq{E_{j}},\forall j,t$		(33c)
	$\displaystyle\qquad{\bm{G}_{it}}\succeq 0,\forall i,t$		(33d)
	$\displaystyle\qquad{\rm{rank}}({\bm{G}_{it}})\leq 1,\forall i,t$		(33e)
	$\displaystyle\qquad\rm{(\ref{eq:SINR_condition}),(\ref{eq:original_problem_p0}b)-(\ref{eq:original_problem_p0}f).}$		(33f)

Like (14), (33) is difficult to be directly solved; thus, we first vertically decompose it into the following two subproblems.

•

Uplink optimization subproblem: The uplink optimization subproblem is formulated as


	$\displaystyle\mathop{{\rm{maximize}}}\limits_{\{\bm{a}_{t}^{\rm ul},\bm{p}_{t}\}}\quad\frac{1}{T}\sum\limits_{t=1}^{T}\left({B_{t}^{\rm ul}\left({\bm{a}_{t}^{\rm ul}}\right)-\sum\limits_{i\in{\mathcal{U}}}{\sum\limits_{j\in{\mathcal{J}}}{a_{ijt}^{{\rm{ul}}}p_{it}^{{\rm{tot}}}/{{\tilde{p}}_{i}}}}}\right)$		(34a)
	$\displaystyle{\rm s.t.}\text{ }\quad\rm{(\ref{eq:SINR_condition}),(\ref{eq:original_problem_p0}b)-(\ref{eq:original_problem_p0}d),(\ref{eq:original_problem_p0}f).}$		(34b)

•

Downlink optimization subproblem: The downlink optimization subproblem can be formulated as follows


	$\displaystyle\mathop{{\rm{maximize}}}\limits_{\{\bm{a}_{t}^{\rm dl},\bm{G}_{it}\}}\quad\frac{1}{T}\sum\limits_{t=1}^{T}{B_{t}^{\rm dl}\left({\bm{a}_{t}^{\rm dl}}\right)}$		(35a)
	$\displaystyle{\rm s.t.}\text{ }\quad\rm{(\ref{eq:original_problem_p0}e),(\ref{eq:transformed_problem}b)-(\ref{eq:transformed_problem}e).}$		(35b)

Next, we propose to solve the two subproblems separately by exploring DRL approaches.

IV-B Solution to the uplink optimization subproblem

(34) is confirmed to be a mixed-integer and sequence-dependent optimization subproblem. Fig. 5 shows a DRL approach of solving (34).

In this figure, a DNN is trained to produce continuous actions. The continuous actions are then quantized into a group of association (or discrete) actions. Given the association actions, we solve an optimization problem to select an association action maximizing the reward. Next, we describe the designing process of solving (34) using a DRL-based uplink optimization method in detail.

IV-B1 Action, state, and reward design

First, we elaborate on the design of the state space, action space, and reward function of the DRL-based method. The HMDs’ transmit power and the varying channel gains caused by users’ movement and/or time-varying wireless channel environments have a significant impact on whether uplink transmission signals can be successfully decoded by APs. In addition, each AP has a limited ability to decode uplink transmission signals simultaneously. Therefore, we design the state space, action space, and reward function of the DRL-based method as the following.

•

state space $\bm{s}_{t}^{\rm ul}$ : $\bm{s}_{t}^{\rm ul}=[{\bm{m}}_{t};\hat{\bm{h}}_{t}^{\rm ul};{\bm{p}}_{t}]$ is a column vector, where $m_{jt}\in{\bm{m}}_{t}\in{\mathbb{R}}^{J}$ , $\forall j$ , denotes the number of users successfully access to AP $j$ at time slot $t$ . Besides, the state space involves the path-loss from user $i$ to AP $j$ , $\hat{h}_{ijt}\in\hat{\bm{h}}_{t}^{\rm ul}\in{\mathbb{R}}^{NJ}$ , $\forall i$ , $j$ , $t$ , and the transmit power of user $i$ ’s HMD at time slot $t$ , $p_{it}\in{\bm{p}}_{t}\in{\mathbb{R}}^{N}$ , $\forall i$ , $t$ .
•

action space $\bm{a}_{t}^{\rm ul}$ : $\bm{a}_{t}^{\rm ul}=[a_{11t}^{\rm ul},\ldots,a_{1Jt}^{\rm ul},\ldots,a_{NJt}^{\rm ul}]^{\rm T}\in{\mathbb{R}}^{NJ}$ with $a_{ijt}^{\rm ul}\in\{0,1\}$ . The action of the DRL-based method is to deliver users’ data information to associated APs.

•

reward $r_{t}^{\rm ul}$ : given $\bm{a}_{t}^{\rm ul}$ , the reward $r_{t}^{\rm ul}$ is the objective function value of the following power control subproblem.


	$\displaystyle r_{t}^{\rm ul}=\mathop{{\rm{maximize}}}\limits_{{\bm{p}_{t}}}\text{ }B_{t}^{\rm ul}({\bm{a}}_{t}^{\rm ul})-\sum\nolimits_{i\in{\mathcal{U}}}{\sum\nolimits_{j\in{\mathcal{J}}}{a_{ijt}^{{\rm{ul}}}p_{it}^{{\rm{tot}}}/{{\tilde{p}}_{i}}}}\allowdisplaybreaks[4]$		(36a)
	$\displaystyle{\rm s.t.}\text{ }\quad\rm{(\ref{eq:SINR_condition}),(\ref{eq:original_problem_p0}f).}$		(36b)

IV-B2 Training process of the DNN

For the DNN module $\bar{\bm{a}}_{t}^{\rm ul}=\mu({\bm{s}}_{t}^{\rm ul}|\theta_{t}^{\mu})$ shown in Fig. 5, where $\bar{\bm{a}}_{t}^{\rm ul}=[\bar{\bm{a}}_{1t}^{\rm ul};\ldots;\bar{\bm{a}}_{Nt}^{\rm ul}]$ and $\theta_{t}^{\mu}$ represents network parameters, we explore a two-layer fully-connected feedforward neural network with network parameters being initialized by a Xavier initialization scheme. There are $N_{1}^{\mu}$ and $N_{2}^{\mu}$ neurons in the $1^{\rm st}$ and $2^{\rm nd}$ hidden layers of the constructed DNN, respectively. Here, we adopt the ReLU function as the activation function in these hidden layers. For the output layer, a sigmoid activation function is leveraged such that relaxed association variables satisfy $\bar{a}_{ijt}^{\rm ul}\in(0,1)$ . In the action-exploration phase, the exploration noise $\epsilon{N_{f}}$ is added to the output layer of the DNN, where $\epsilon\in(0,1)$ decays over time and ${N_{f}}\sim{\mathcal{N}}(0,\sigma^{2})$ .

To train the DNN effectively, the experience replay technique is exploited. This is because there are two special characteristics in the process of enhancing users’ fictitious experiences: 1) the collected input state values $\bm{s}_{t}^{\rm ul}$ incrementally arrive as users move to new positions, instead of all made available at the beginning of the training; 2) APs consecutively collect state values indicating that the collected state values may be closely correlated. The DNN may oscillate or diverge without breaking the correlation among the input state values. Specifically, at each training epoch $t$ , a new training sample $(\bm{s}_{t}^{\rm ul},{\bm{a}}_{t}^{\rm ul},\bm{s}_{t+1}^{\rm ul})$ is added to the replay memory. When the memory is filled, the newly generated sample replaces the oldest one. We randomly choose a minibatch of training samples $\{(\bm{s}_{\tau}^{\rm ul},{\bm{a}}_{\tau}^{\rm ul},\bm{s}_{\tau+1}^{\rm ul})|\tau\in{\mathcal{T}}_{t}\}$ from the replay memory, where ${\mathcal{T}}_{t}$ is a set of training epoch indices. The network parameters $\theta_{t}^{\mu}$ are trained using the ADAM method [29] to reduce the averaged cross-entropy loss

\begin{array}[]{l}L(\theta_{t}^{\mu})=-\frac{1}{{|{{\mathcal{T}}_{t}}|}}\sum\nolimits_{\tau\in{{\mathcal{T}}_{t}}}({({\bm{a}}_{\tau}^{\rm ul})^{\rm T}}\log{\bar{\bm{a}}_{\tau}^{\rm ul}}+(1-{{\bm{a}}_{\tau}^{\rm ul}})^{\rm T}\log(1-{\bar{\bm{a}}_{\tau}^{\rm ul}})).\end{array}

(37)

As evaluated in the simulation, we can train the DNN every $T_{ti}$ epochs after collecting a sufficient number of new data samples.

IV-B3 Action quantization and selection method

In the previous subsection, we design a continuous policy function and generate a continuous action space. However, a discrete action space is required in this paper. To this aim, the generated continuous action should be quantized, as shown in Fig. 5. A quantized action will directly determine the feasibility of the optimization subproblem and then the convergence performance of the DRL-based optimization method. To improve the convergence performance, we should increase the diversity of the quantized action set, which including all quantized actions. Specifically, we quantize the continuous action $\bar{\bm{a}}_{t}^{\rm ul}$ to obtain $\tilde{V}\in\{1,2,\ldots,2^{N}\}$ groups of association actions and denote by ${\bar{\bm{a}}_{t;v}^{\rm ul}}$ the $v$ -th group of actions. Given ${\bar{\bm{a}}_{it;v}^{\rm ul}}$ , (36) is reduced to a linear programming problem, and we can derive its closed-form solution as below

{p_{it}}=\left\{{\begin{array}[]{*{20}{l}}{\sum\limits_{j}{\frac{{a_{ijt}^{{\rm{ul}}}{\theta^{{\rm{th}}}}{N_{0}}{W^{{\rm{ul}}}}}}{{N{f_{i}}{{\hat{h}}_{ijt}}}},}}&{\sum\limits_{j}{\frac{{a_{ijt}^{{\rm{ul}}}{\theta^{{\rm{th}}}}{N_{0}}{W^{{\rm{ul}}}}}}{{N{f_{i}}{{\hat{h}}_{ijt}}}}}\leq{{\tilde{p}}_{i}}-p_{i}^{c},}\\ {0,}&{{\rm otherwise.}}\end{array}}\right.

(38)

Besides, a great $\tilde{V}$ will result in higher diversity in the quantized action set but a higher computational complexity, and vice versa. To balance the performance and complexity, we set $\tilde{V}=N$ and propose a lightweight action quantization and selection method. The detailed steps of quantizing and selecting association actions are given in Algorithm 2.

Algorithm 2 Action quantization and selection

1: Input: The output action of the uplink DNN,

{\bar{\bm{a}}_{t}^{\rm ul}}

2: Arrange

{\bar{\bm{a}}_{t}^{\rm ul}}

as a matrix of size

N\times J

and generate a vector

\hat{\bm{a}}_{t}^{\rm ul}=\left\{{\max[\bar{a}_{i1t}^{\rm ul},\ldots,\bar{a}_{iJt}^{\rm ul}],\forall i}\right\}

3: Generate the reference action vector

\bar{\bm{b}}_{t}=[\bar{b}_{1t},\ldots,\bar{b}_{vt},\ldots,\bar{b}_{\tilde{V}t}]^{\rm T}

by sorting the absolute value of all elements of

\hat{\bm{a}}_{t}^{\rm ul}

in ascending order.

4: For any user

i

, generate the

1^{\rm st}

group of association actions by

{\hat{a}_{it;1}^{\rm ul}}=\left\{{\begin{array}[]{*{20}{c}}{1,}&{{\hat{a}_{it}^{\rm ul}}>0.5,}\\ {0,}&{{\hat{a}_{it}^{\rm ul}}\leq 0.5.}\end{array}}\right.

(39)

5: For any user

i

, generate the remaining

\tilde{V}-1

groups of association actions by

{\hat{a}_{it;v}^{\rm ul}}=\left\{{\begin{array}[]{*{20}{c}}{1,}&{{\hat{a}_{it}^{\rm ul}}>{{\bar{b}}_{(v-1)t}},\ v=2,\ldots,\tilde{V},}\\ {0,}&{{\hat{a}_{it}^{\rm ul}}\leq{{\bar{b}}_{(v-1)t}},\ v=2,\ldots,\tilde{V}.}\end{array}}\right.

(40)

6: For each group of association actions

v\in\{1,2,\ldots,\tilde{V}\}

, user

i

, and AP

j

, set

\tilde{a}_{ijt;v}^{{\rm{ul}}}=\left\{{\begin{array}[]{*{20}{c}}{1,}&{\hat{a}_{it;v}^{{\rm{ul}}}=1,j=j^{\star},}\\ {0,}&{\rm otherwise.}\end{array}}\right.

(41)

where,

j^{\star}=\mathop{\arg\max}\limits_{j}[\bar{a}_{i1t}^{\rm ul},\ldots,\bar{a}_{iJt}^{\rm ul}]

7: For each group of association actions

v\in\{1,2,\ldots,\tilde{V}\}

, given the vector

{\tilde{\bm{a}}_{t;v}^{\rm ul}}=[\tilde{a}_{i1t;v}^{{\rm{ul}}},\ldots,\tilde{a}_{iJt;v}^{{\rm{ul}}}]_{i}^{\rm T}

\forall i

, solve (36) to obtain

r_{vt}^{\rm ul}

8: Select the association action

{\bm{a}}_{t}^{\rm ul}=\arg\mathop{\max}\nolimits_{\{\tilde{a}_{ijt;v}^{{\rm{ul}}}\}}r_{vt}^{{\rm{ul}}}

9: Output: The association action

{\bm{a}}_{t}^{\rm ul}

Summarily, the proposed DRL-based uplink optimization method can be presented in Algorithm 3.

Algorithm 3 DRL-based uplink optimization

1: Initialize: The maximum number of episodes

N_{epi}

, the maximum number of epochs per episode

N_{epo}

, initial exploration decaying rate

\epsilon

, DNN

\mu(\bm{s}_{t}^{\rm ul}|\theta_{t}^{\mu})

with network parameters

\theta_{t}^{\mu}

, initial reward

r_{0}^{\rm ul}=1

, and users’ randomly initialized transmit power.

2: Initialize: Replay memory with capacity

C

, minibatch size

|{{\mathcal{T}}_{t}}|

, and DNN training interval

T_{\rm ti}

3: for each episode in

\{1,\ldots,N_{epi}\}

4: Calculate the state space according to locations of APs and users and users’ randomly initialized transmit power.

5: for each epoch

\bar{t}=1,\ldots,N_{epo}

6: Select a relaxed action vector

\bar{\bm{a}}_{\bar{t}}^{\rm ul}=\mu({\bm{s}}_{\bar{t}}^{\rm ul}|\theta_{\bar{t}}^{\mu})+\epsilon N_{f}

, where

\epsilon

decays over time.

7: Call Algorithm 2 to choose the association action

\bm{a}_{\bar{t}}^{\rm ul}

8: if

\bm{a}_{\bar{t}}^{\rm ul}

results in the violation of constraints in (34) then

9: Cancel the action and update the reward by

r_{\bar{t}}^{\rm ul}=r_{\bar{t}}^{\rm ul}-\varpi|r_{{\bar{t}}-1}^{\rm ul}|

10: else

11: Execute the action and observe the subsequent state

{\bm{s}}_{{\bar{t}}+1}^{\rm ul}

12: end if

13: Store the transition

(\bm{s}_{\bar{t}}^{\rm ul},\bm{a}_{\bar{t}}^{\rm ul},\bm{s}_{{\bar{t}}+1}^{\rm ul})

in the memory.

14: If

{\bar{t}}\geq|{{\mathcal{T}}_{t}}|

, sample a random minibatch of

|{{\mathcal{T}}_{t}}|

transitions

(\bm{s}_{m}^{\rm ul},\bm{a}_{m}^{\rm ul},\bm{s}_{m+1}^{\rm ul})

from the memory.

15: If

\bar{t}\mod T_{\rm ti}==0

, update the network parameters

{\theta_{\bar{t}}^{{\mu}}}

by minimizing the loss function

L({\theta_{\bar{t}}^{{\mu}}})

using the ADAM method.

16: end for

17: end for

IV-C Solution to the downlink optimization subproblem

Like (34), (35) is also a mixed-integer and sequence-dependent optimization problem. Therefore, the procedure of solving (35) is similar to that of solving (34), and we do not present the detailed steps of the DRL-based downlink optimization method in this subsection for brevity. However, there are differences in some aspects, for example, the design of action and state space and the reward function. For the DRL-based downlink optimization method, we design its action space, state space, and the reward function as the following.

•

state space $\bm{s}_{t}^{\rm dl}$ : $\bm{s}_{t}^{\rm dl}=[\bm{o}_{t};\bm{h}_{t};{\bm{I}}_{t}^{\rm dl};\bm{g}_{t}]$ is a column vector, where $o_{jt}\in{\bm{o}}_{t}\in{\mathbb{R}}^{J}$ indicates the number of users to which AP $j$ transmits VR video frames, $h_{ijkt}\in{\bm{h}}_{t}\in{\mathbb{C}}^{NJK}$ , $I_{imt}\in{\mathbb{R}}^{N\times N}\in{\bm{I}}_{t}^{\rm dl}$ denotes whether user $m$ is the interfering user of user $i$ , and $g_{ijkt}\in{\bm{g}}_{t}\in{\mathbb{C}}^{NJK}$ .
•

action space $\bm{a}_{t}^{\rm dl}$ : $\bm{a}_{t}^{\rm dl}=[a_{1t}^{\rm dl},\ldots,a_{it}^{\rm dl},\ldots,a_{Nt}^{\rm dl}]^{\rm T}$ with $a_{it}^{\rm dl}\in\{0,1\}$ . The action of the DRL-based method at time slot $t$ is to transmit VR video frames to corresponding users.

•

reward $\bm{r}_{t}^{\rm dl}$ : given $\bm{a}_{t}^{\rm dl}$ , the reward $r_{t}^{\rm dl}$ is the objective function value of the following power control subproblem.


	$\displaystyle r_{t}^{\rm dl}=\mathop{{\rm{maximize}}}\limits_{{{\bm{G}_{it}}}}{\text{ }}B_{t}^{{\rm{dl}}}\left({\bm{a}_{t}^{{\rm{dl}}}}\right)$		(42a)
	$\displaystyle{\rm s.t.}\text{ }\quad{\rm(\ref{eq:transformed_problem}b)-(\ref{eq:transformed_problem}e)}.$		(42b)

To solve (42), Algorithm 2 can be adopted to obtain the downlink association action $\bm{a}_{t}^{\rm dl}$ . However, given $\bm{a}_{t}^{\rm dl}$ , it is still hard to solve (42) as (42) is a non-convex programming problem with the existence of the non-convex low-rank constraint (33e). To handle the non-convexity, a semidefinite relaxation (SDR) scheme is exploited. The idea of the SDR scheme is to directly drop out the non-convex low-rank constraint. After dropping the constraint (33e), it can confirm that (42) becomes a standard convex semidefinite programming (SDP) problem. This is because (33b) are (33c) are linear constraints w.r.t ${\bm{G}}_{it}$ and (42a) is a constant objective function. We can then explore some optimization tools such as MOSEK to solve the standard convex SDP problem effectively. However, owing to the relaxation, power matrices $\{\bm{G}_{it}\}$ obtained by mitigating (42) without low-rank constraints will not satisfy the low-rank constraint in general. This is due to the fact that the (convex) feasible set of the relaxed (42) is a superset of the (non-convex) feasible set of (42). The following lemma reveals the tightness of exploring the SDR scheme.

Lemma 2.

For any user $i$ at time slot $t$ , denote by $\bm{G}_{it}^{\star}$ the solution to (42). If ${\mathcal{M}}_{it}=\emptyset$ , then the SDR for $\bm{G}_{it}$ in (42) is tight, that is, ${\rm rank}(\bm{G}_{it}^{\star})\leq 1$ ; otherwise, we can not claim ${\rm rank}(\bm{G}_{it}^{\star})\leq 1$ .

Proof.

The Karush-Kuhn-Tucker (KKT) conditions can be explored to prove the tightness of resorting to the SDR scheme. Nevertheless, we omit the detailed proof for brevity as a similar proof can be found in Appendix of the work [30]. ∎

With the conclusion in Lemma 2, we can recover beamformers from the obtained power matrices. If ${\rm rank}(\bm{G}_{it}^{\star})\leq 1$ , $\forall i$ , then execute eigenvalue decomposition on $\bm{G}_{it}^{\star}$ , and the principal component is the optimal beamformer $\bm{g}_{it}^{\star}$ ; otherwise, some manipulations such as a randomization/scale scheme [31] should be performed on $\bm{G}_{it}^{\star}$ to impose the low-rank constraint.

Note that (42) should be solved for $\tilde{V}$ times at each time slot. To speed up the computation, they can be optimized in parallel. Moreover, it is tolerable to complete the computation within the interval $[t,t+M]$ as users’ locations in $M$ time slots are obtained.

Finally, we can summarize the DRL-based optimization algorithm of mitigating the problem of enhancing users’ VR experiences in Algorithm 4.

Algorithm 4 DRL-based optimization algorithm

1: Initialization: Run initialization steps of Algorithms 1, 2, and 3, and initialize the ESN training interval

T_{\rm pr}

2: Call Algorithm 3 to pre-train the uplink DNN

\mu({\bm{s}}_{t}^{\rm ul}|\theta_{t}^{\mu})

. Likewise, pre-train the downlink DNN

\mu({\bm{s}}_{t}^{\rm dl}|\theta_{t}^{Q})

3: Run steps 2-8 of Algorithm 1 to pre-train ESN models.

4: for each time slot

t=1,2,\ldots,T

5: Run step 9 of Algorithm 1 to obtain predicted location

\hat{\bm{y}}_{i(t+M)}

of each user

i

6: Run steps 6-12 of Algorithm 3 to obtain uplink association action

{\bm{a}}_{t+M}^{\rm ul}

and transmit power

{\bm{p}}_{t+M}

. Likewise, optimize the downlink association action

{\bm{a}}_{t+M}^{\rm dl}

and transmit beamformer

{\bm{g}}_{i(t+M)}

for each user

i

7: if

t\mod T_{\rm pr}==0

then

8: Steps 2-8 of Algorithm 1.

9: end if

10: end for

V Simulation and Performance Evaluation

V-A Comparison algorithms and parameter setting

To verify the effectiveness of the proposed algorithm, we compare it with three benchmark algorithms: 1) $k$ -nearest neighbors (KNN) based action quantization algorithm: The unique difference between the KNN-based algorithm and the proposed algorithm lies in the scheme of quantizing uplink and downlink action spaces. For the KNN-based algorithm, it adopts the KNN method [32] to quantize both uplink and downlink action spaces; 2) DROO algorithm: Different from the proposed algorithm, DROO leverages the order-preserving quantization method [32] to quantize both uplink and downlink action spaces; 3) Heuristic algorithm: The heuristic algorithm leverages the greedy admission algorithm in [33] to determine $\bm{a}_{t}^{\rm ul}$ and $\bm{a}_{t}^{\rm dl}$ at each time slot $t$ . Besides, the user consuming less power in this algorithm will establish the connection with an AP(s) on priority.

To test the practicality of the developed parallel ESN learning method, realistic user movement datasets are generated via Google Map. Particularly, for a user, we randomly select its starting position and ending position on the campus of Singapore University of Technology and Design (SUTD). Given two endpoints, we use Google Map to generate the user’s 2D trajectory. Next, we linearly zoom all $N$ users’ trajectories into the communication area of size $0.5\times 0.5$ km².

Additionally, the parameters related to APs and downlink transmission channels are listed as follows: the number of APs $J=3$ , the number of antenna elements $K=2$ , the antenna gain $G=5$ dB, $g=1$ dB, $\phi=\pi/3$ , $\vartheta=\pi/2$ , $W^{\rm dl}=800$ MHz, $\gamma^{\rm th}=1$ Gb/s, $\eta_{\rm LoS}=2.0$ , $\eta_{\rm NLoS}=2.4$ , $\sigma_{\rm LoS}^{2}=5.3$ , $\sigma_{\rm NLoS}^{2}=5.27$ , $D^{\rm th}=50$ m, $x_{o}=y_{o}=250$ m, $\theta_{j}=\pi/3$ , $\tilde{E}_{j}=40$ dBm, $E_{j}^{c}=30$ dBm, $H_{j}=5.5$ m, $\forall j$ [19]. User and uplink transmission channel-related parameters are shown as below: uplink system bandwidth $W^{\rm ul}=200$ MHz, $\theta^{\rm th}=200$ , $\bar{h}=1.8$ m, $\sigma_{h}^{2}=0.05$ m, $\alpha=5$ , $c_{ij}=0.3$ , $p_{i}^{c}=23$ dBm, $\tilde{p}_{i}=27$ dBm, $\forall i$ , $j$ .

Set other learning-correlated parameters as below: ${\zeta}=1$ , $\xi=0.25$ , $\bar{r}_{\rm max}=1000$ , the sample number $Q=6$ , the number of future time slots $M=8$ , $N_{i}=2$ , $\forall i$ , $N_{o}=2$ , $N_{r}=300$ , and $T_{\rm pr}=5$ . For both uplink DNN and downlink DNN, the first hidden layer has $120$ neurons, and the second hidden layer has $80$ neurons. The replay memory capacity $C=1e$ +6, $N_{epi}=10$ , $N_{epo}=1000$ , $\varpi=10$ , $\sigma^{2}=0.36$ , $\epsilon=0.99$ . More system parameters are listed as follows: carrier frequency $f_{c}=28$ GHz, light of speed $c=3.0e$ +8 m/s, noise power spectral density $N_{0}=-167$ dBm/Hz, and $T=5000$ time slots.

V-B Performance evaluation

To comprehensively understand the accuracy and the availability of the developed learning and optimization methods, we illustrate their performance results. In this simulation, we first let the AP number $J=3$ and the mobile user number $N=16$ .

To validate the accuracy of the parallel ESN learning method on predicting mobile users’ locations, we plot the actual trajectory of a randomly selected mobile user and its correspondingly predicted trajectory in Fig. 6(a). In Fig. 6(b), the accuracy, which is measured by the normalized root mean-squared error (NRMSE) [25], of predicted trajectories of $16$ mobile users is plotted. From Fig. 6, we can observe that: i) when the orientation angles of users will not change fast, the learning method can exactly predict users’ locations. When users change their moving directions quickly, the method loses their true trajectories. However, the method will re-capture users’ tracks after training ESN models based on newly collected users’ location samples; ii) the obtained NRMSE of the predicted trajectories of all mobile users will not be greater than $0.03$ . Therefore, we may conclude that the developed parallel ESN learning method can be utilized to predict mobile users’ locations.

Next, to evaluate the performance of the proposed DRL-based optimization algorithm comprehensively, we illustrate the impact of some DRL-related crucial parameters such as minibatch size, training interval, and learning rate on the convergence performance of the proposed algorithm. DNN training loss and moving average reward, which is the average of the achieved rewards over the last $50$ epochs, are leveraged as the evaluation indicators.

Fig. 7 plots the tendency of the DNN training loss and the achieved moving average reward of the proposed algorithm under diverse minibatch sizes. This figure illustrates that: i) a great minibatch size value will cause the DNN to converge slowly or even not. As shown in Fig. 7(a), $L(\theta_{465}^{\mu})=0.1885$ when we set $|{\mathcal{T}}_{t}|=512$ . Yet, $L(\theta_{465}^{\mu})=0.1023$ when we let $|{\mathcal{T}}_{t}|=64$ . The result in Fig. 7(b) shows that DNN does not converge after 10000 epochs when $|{\mathcal{T}}_{t}|=2048$ . This is because a great $|{\mathcal{T}}_{t}|$ indicates overtraining, resulting in the local minima and degraded convergence performance. Further, a large minibatch size value consumes more training time at each training epoch. Therefore, we set the training minibatch size $|{\mathcal{T}}_{t}|=64$ in the simulation; ii) when $|{\mathcal{T}}_{t}|=64$ , $r_{\bar{t}}^{\rm ul}$ and $r_{\bar{t}}^{\rm dl}$ gradually increase and stabilize at around $0.7141$ and $0.9375$ , respectively. The fluctuation is mainly caused by the random sampling of training data and user movement.

Fig. 8 illustrates the tendency of obtained uplink and downlink DNN training losses and moving average rewards under diverse training interval values. From this figure, we can observe that a small training interval value indicates faster convergence speed. For example, if we set the training interval ${T_{\rm ti}}=5$ , the obtained $r_{\bar{t}}^{\rm ul}$ converges to $0.7156$ when epoch $\bar{t}>439$ . If we let the training interval ${T_{\rm ti}}=100$ , $r_{\bar{t}}^{\rm ul}$ converges to $0.7149$ when epoch $\bar{t}>4975$ , as shown in Fig. 8(b). However, it is unnecessary to train and update the DNN frequently, which will bring more frequent policy updates, if the DNN can converge. Thus, to achieve the trade-off between the convergence speed and the policy update speed, we set ${T_{\rm ti}}=20$ in the simulation.

Fig. 9 depicts the tendency of achieved DNN training loss and moving average reward of the proposed algorithm under different learning rate configurations. From this figure, we have the following observations: i) for the uplink DNN, when given a small learning rate value, it may converge to the local optimum or even not; ii) for the downlink DNN, both a small and a great learning rate value will degrade convergence performance. Therefore, when training the uplink DNN, we set the learning rate $l_{r}^{\rm ul}=0.1$ , which can lead to good convergence performance. For instance, $r_{\bar{t}}^{\rm ul}$ converges to $0.7141$ when epoch $\bar{t}\geq 1300$ and the variance of $r_{\bar{t}}^{\rm ul}$ gradually decreases to zero with an increasing epoch $\bar{t}$ . We set the learning rate $l_{r}^{\rm dl}=0.01$ when training the downlink DNN. Given this parameter setting, the obtained $L(\theta_{\bar{t}}^{Q})$ is smaller than $0.2$ after training for $200$ epochs.

At last, we verify the superiority of the proposed algorithm by comparing it with other comparison algorithms. Particularly, we plot the achieved objective function values of all comparison algorithms under varying number of mobile users $N\in\{8,12,16,20\}$ in Fig. 10. Before the evaluation, the proposed algorithm and the other two action quantization algorithms have been trained with $10000$ independent wireless channel realizations, and their downlink and uplink action quantization policies have converged. This is reasonable because we are more interested in the long-term operation performance for field deployment. Besides, we let the service ability of an AP $\tilde{M}$ vary with $N$ with the $(N,\tilde{M})$ pair being $(8,3)$ , $(12,5)$ , $(16,6)$ , and $(20,7)$ .

We have the following observations from this figure: i) the proposed algorithm achieves the greatest objective function value. For the DROO algorithm, it gains a smaller objective function value than the proposed algorithm; for example, the achieved objective function value of DROO is $4.14\%$ less than that of the proposed algorithm. For the KNN-based algorithm, it obtains the smallest objective function value because it offers the smallest diversity in the produced uplink and downlink association action set; ii) except for heuristic algorithm, the achieved objective function values of the other comparison algorithms decrease with the number of users owing to the increasing total power consumption. For the heuristic algorithm, its obtained objective function value increases with $N$ mainly because more users can successfully access to APs.

VI Conclusion

This paper investigated the problem of enhancing VR visual experiences for mobile users and formulated the problem as a sequence-dependent problem aiming at maximizing users’ feeling of presence in VR environments while minimizing the total power consumption of users’ HMDs. This problem was confirmed to be a mixed-integer and non-convex optimization problem, the solution of which also needed accurate users’ tracking information. To solve this problem effectively, we developed a parallel ESN learning method to predict users’ tracking information, with which a DRL-based optimization algorithm was proposed. Specifically, this algorithm first decomposed the formulated problem into an association subproblem and a power control subproblem. Then, a DNN joint with an action quantization scheme was implemented as a scalable solution that learnt association variables from experience. Next, the power control subproblem with an SDR scheme being explored to tackle its non-convexity was leveraged to criticize the association variables. Finally, simulation results were provided to verify the accuracy of the learning method and showed that the proposed algorithm could improve the energy efficiency by at least $4.14\%$ compared with various benchmark algorithms.

-A Proof of Lemma 1

For any user $i\in{\mathcal{U}}$ , suppose we are provided with a sequence of $Q$ desired input-output pairs $\{(\bm{x}_{i(t-Q)},{\bm{y}}_{i(t-Q+1)}),\ldots,(\bm{x}_{i(t-1)},{\bm{y}}_{it})\}$ . With the input-output pairs, generate the hidden matrix ${\bm{X}_{it}}=\left[{\begin{array}[]{*{20}{c}}\begin{array}[]{l}{\bm{x}_{i(t-1)}}\\ {\bm{s}_{i(t-1)}}\end{array}&\cdots&\begin{array}[]{l}{\bm{x}_{i(t-Q)}}\\ {\bm{s}_{i(t-Q)}}\end{array}\end{array}}\right]$ and the corresponding target location matrix ${\bm{Y}}_{it}=[\bm{y}_{it}^{\rm T};\ldots;\bm{y}_{i(t-Q+1)}^{\rm T}]$ at time slot $t$ . We next introduce an auxiliary matrix $\bm{U}=\bm{X}^{\rm T}\bm{W}\in{\mathbb{R}}^{Q\times N_{o}}$ , wherein we lighten the notation $\bm{X}_{it}$ for $\bm{X}$ . According to the Lagrange dual decomposition method, we can rewrite (18) as follows

\begin{array}[]{l}\frac{1}{Q}\mathop{\rm minimize}\limits_{\bm{W},\bm{U}}\quad\left\{{{l(\bm{X}^{\rm T}\bm{W})}+\xi Qr(\bm{W})}\right.+\left.\bm{A}\odot\bm{U}-\bm{A}\odot\bm{X^{\rm T}W}\right\}\\ =\frac{1}{Q}\mathop{\inf}\limits_{\bm{W}}\left\{{-\sum\limits_{n=1}^{N_{o}}{[\bm{A}\bm{z}_{n}]^{\rm T}{\bm{X}^{\rm T}\bm{W}\bm{z}_{n}}}+\xi Qr(\bm{W})}\right\}+\\ \qquad\qquad\frac{1}{Q}\mathop{\inf}\limits_{\bm{U}}\left\{{{l(\bm{U})}+\sum\limits_{n=1}^{N_{o}}{[\bm{A}\bm{z}_{n}]^{\rm T}\bm{U}\bm{z}_{n}}}\right\}\\ =-{\xi}\mathop{\sup}\limits_{\bm{W}}\left\{{\frac{1}{{\xi Q}}\sum\limits_{n=1}^{N_{o}}{[\bm{A}\bm{z}_{n}]^{\rm T}\bm{X}^{\rm T}\bm{W}\bm{z}_{n}}-{r(\bm{W})}}\right\}-\\ \qquad\qquad\frac{1}{Q}\sum\limits_{j=1}^{J}\sum\limits_{m\in{\mathcal{Q}}_{j}}{\sum\limits_{n=1}^{N_{o}}{\mathop{\sup}\limits_{u_{mn}}\left\{{-a_{mn}u_{mn}-{l(u_{mn})}}\right\}}}\\ =-\xi{r^{\star}}\left(\frac{1}{\xi Q}{\bm{A}^{\rm T}{\bm{X}^{\rm T}}}\right)-\frac{1}{Q}\sum\limits_{j=1}^{J}\sum\limits_{m\in{\mathcal{Q}}_{j}}{\sum\limits_{n=1}^{N_{o}}{l^{\star}(-a_{mn})}}\\ \buildrel\Delta\over{=}D(\bm{A})\end{array}

(43)

where $\bm{z}_{n}\in{\mathbb{R}}^{N_{o}}$ is a column vector with the $n$ -th element being one and all other elements being zero, ${\mathcal{Q}}_{j}$ is an index set including the indices of $Q$ data samples fed to slave VM $j$ .

Let $\bar{r}(\bm{C})={\frac{1}{\xi Q}\sum\limits_{n=1}^{N_{o}}{\bm{z}_{n}^{\rm T}\bm{C}{\bm{Wz_{n}}}}-{r(\bm{W})}}$ , where ${\bm{C}}=\frac{1}{{\xi Q}}\bm{A}^{\rm T}\bm{X}^{\rm T}$ , and denote ${\bm{W}}^{\star}$ as the optimal solution to $\mathop{\sup}\limits_{\bm{W}}{\bar{r}(\bm{C})}$ . Then, calculate the derivative of ${\bar{r}(\bm{C})}$ w.r.t ${\bm{W}}$ ,

\frac{{d\bar{r}(\bm{C})}}{{d{\bm{W}}}}=\sum_{n=1}^{N_{o}}{{\bm{C}}_{n}{\bm{z}}_{n}^{\rm T}}-2{\bm{W}}

(44)

where ${\bm{C}}_{n}={\bm{C}}^{\rm T}{\bm{z}}_{n}$ .

As ${\bm{W}\in{\mathbb{R}}^{{(N_{i}+N_{r})}\times N_{o}}}$ , the necessary and sufficient condition for obtaining ${\bm{W}}^{\star}$ is to enforce $\frac{{d\bar{r}(\bm{C})}}{{d{\bm{W}}^{\star}}}=0$ . Then, we have

{\bm{W}}^{\star}=\frac{1}{2}\sum_{n=1}^{N_{o}}{{\bm{C}}_{n}{\bm{z}}_{n}^{\rm T}}

(45)

By substituting (45) into $r^{\star}({\bm{C}})$ , we can obtain (22).

Similarly, denote $u_{mn}^{\star}$ for any $m\in\{1,2,\ldots,Q\}$ and $n\in\{1,2,\ldots,N_{o}\}$ as the optimal solution to $l^{\star}(-a_{mn})$ . As ${\bm{U}}\in{\mathbb{R}}^{Q\times N_{o}}$ , the necessary and sufficient condition for $u_{mn}^{\star}$ is to execute $\frac{{dl^{\star}(-a_{mn})}}{{d{u_{mn}^{\star}}}}=-a_{mn}-u_{mn}^{\star}+y_{mn}=0$ . By substituting $u_{mn}^{\star}$ into $l^{\star}(-a_{mn})$ , we can obtain (23). This completes the proof.

References

[1] X. Hou, S. Dey, J. Zhang, and M. Budagavi, “Predictive adaptive streaming to enable mobile 360-degree and VR experiences,” IEEE Trans. Multimedia, vol. 23, pp. 716–731, 2021.
[2] H. Bellini, “The real deal with virtual and augmented reality,” https://www.goldmansachs.com/insights/pages/virtual-and-augmented-reality.html, Feb. 2016.
[3] C. Wiltz, “5 major challenges for VR to overcome,” https://www.designnews.com/electronics-test/5-major-challenges-vr-overcome, Apr, 2017.
[4] Y. Liu, J. Liu, A. Argyriou, and S. Ci, “MEC-assisted panoramic VR video streaming over millimeter wave mobile networks,” IEEE Trans. Multimedia, vol. 21, no. 5, pp. 1302–1316, 2019.
[5] J. Dai, Z. Zhang, S. Mao, and D. Liu, “A view synthesis-based $360^{\circ}$ VR caching system over MEC-enabled C-RAN,” IEEE Trans. Circuits Syst. Video Technol., vol. 30, no. 10, pp. 3843–3855, 2020.
[6] Z. Lai, Y. C. Hu, Y. Cui, L. Sun, N. Dai, and H. Lee, “Furion: Engineering high-quality immersive virtual reality on today’s mobile devices,” IEEE Trans. Mob. Comput., vol. 19, no. 7, pp. 1586–1602, 2020.
[7] X. Hou, Y. Lu, and S. Dey, “Wireless VR/AR with edge/cloud computing,” in ICCCN. IEEE, 2017, pp. 1–8.
[8] Qualcomm, “Whitepaper: Making immersive virtual reality possible in mobile,” https://www.qualcomm.com/media/documents/files/whitepaper-making-immersive-virtual-reality-possible-in-mobile.pdf, Mar. 2016.
[9] Oculus, “Mobile VR media overview. Accessed: Sep. 2018,” https://www.oculus.com/, 2018.
[10] HTC, “HTC Vive. Accessed: Sep. 2018,” https://www.vive.com/us/, 2018.
[11] T. Dang and M. Peng, “Joint radio communication, caching, and computing design for mobile virtual reality delivery in fog radio access networks,” IEEE J. Sel. Areas Commun., vol. 37, no. 7, pp. 1594–1607, 2019.
[12] X. Liu, Q. Xiao, V. Gopalakrishnan, B. Han, F. Qian, and M. Varvello, “360^∘ innovations for panoramic video streaming,” in HotNets. ACM, 2017, pp. 50–56.
[13] R. Ju, J. He, F. Sun, J. Li, F. Li, J. Zhu, and L. Han, “Ultra wide view based panoramic VR streaming,” in VR/AR NetworkSIGCOMM. ACM, 2017, pp. 19–23.
[14] S. Mangiante, G. Klas, A. Navon, G. Zhuang, R. Ju, and M. D. Silva, “VR is on the edge: How to deliver $360^{\circ}$ videos in mobile networks,” in VR/AR NetworkSIGCOMM. ACM, 2017, pp. 30–35.
[15] V. R. Gaddam, M. Riegler, R. Eg, C. Griwodz, and P. Halvorsen, “Tiling in interactive panoramic video: Approaches and evaluation,” IEEE Trans. Multimedia, vol. 18, no. 9, pp. 1819–1831, 2016.
[16] X. Corbillon, G. Simon, A. Devlic, and J. Chakareski, “Viewport-adaptive navigable 360-degree video delivery,” in ICC. IEEE, 2017, pp. 1–7.
[17] C. Perfecto, M. S. ElBamby, J. D. Ser, and M. Bennis, “Taming the latency in multi-user VR $360^{\circ}$ : A QoE-aware deep learning-aided multicast framework,” IEEE Trans. Commun., vol. 68, no. 4, pp. 2491–2508, 2020.
[18] M. S. ElBamby, C. Perfecto, M. Bennis, and K. Doppler, “Edge computing meets millimeter-wave enabled VR: paving the way to cutting the cord,” in WCNC. IEEE, 2018, pp. 1–6.
[19] M. Chen, O. Semiari, W. Saad, X. Liu, and C. Yin, “Federated echo state learning for minimizing breaks in presence in wireless virtual reality networks,” IEEE Trans. Wirel. Commun., vol. 19, no. 1, pp. 177–191, 2020.
[20] P. Yang, X. Xi, Y. Fu, T. Q. S. Quek, X. Cao, and D. O. Wu, “Multicast eMBB and bursty URLLC service multiplexing in a CoMP-enabled RAN,” IEEE Trans. Wirel. Commun., vol. 20, no. 5, pp. 3061–3077, 2021.
[21] Q. Cheng, H. Shan, W. Zhuang, L. Yu, Z. Zhang, and T. Q. S. Quek, “Design and analysis of MEC-and proactive caching-based $360^{\circ}$ mobile VR video streaming,” IEEE Trans. Multimedia, 2021, in press. DOI: 10.1109/TMM.2021.3067205.
[22] Y. Sun, Z. Chen, M. Tao, and H. Liu, “Communications, caching, and computing for mobile virtual reality: Modeling and tradeoff,” IEEE Trans. Commun., vol. 67, no. 11, pp. 7573–7586, 2019.
[23] O. Semiari, W. Saad, M. Bennis, and Z. Dawy, “Inter-operator resource management for millimeter wave multi-hop backhaul networks,” IEEE Trans. Wirel. Commun., vol. 16, no. 8, pp. 5258–5272, 2017.
[24] S. Bouchard, J. St-Jacques, G. Robillard, and P. Renaud, “Anxiety increases the feeling of presence in virtual reality,” Presence Teleoperators Virtual Environ., vol. 17, no. 4, pp. 376–391, 2008.
[25] S. Scardapane, D. Wang, and M. Panella, “A decentralized training algorithm for echo state networks in distributed big data applications,” Neural Networks, vol. 78, pp. 65–74, 2016.
[26] H. H. Yang, Z. Liu, T. Q. S. Quek, and H. V. Poor, “Scheduling policies for federated learning in wireless networks,” IEEE Trans. Commun., vol. 68, no. 1, pp. 317–333, 2020.
[27] M. Bennis, S. M. Perlaza, P. Blasco, Z. Han, and H. V. Poor, “Self-organization in small cell networks: A reinforcement learning approach,” IEEE Trans. Wirel. Commun., vol. 12, no. 7, pp. 3202–3212, 2013.
[28] P. Yang, X. Cao, X. Xi, W. Du, Z. Xiao, and D. O. Wu, “Three-dimensional continuous movement control of drone cells for energy-efficient communication coverage,” IEEE Trans. Veh. Technol., vol. 68, no. 7, pp. 6535–6546, 2019.
[29] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR (Poster), 2015.
[30] P. Yang, X. Xi, T. Q. S. Quek, J. Chen, X. Cao, and D. Wu, “How should I orchestrate resources of my slices for bursty URLLC service provision?” IEEE Trans. Commun., vol. 69, no. 2, pp. 1134–1146, 2020.
[31] Z. Luo, W. Ma, A. M. C. So, Y. Ye, and S. Zhang, “Semidefinite relaxation of quadratic optimization problems,” IEEE Signal Processing Magazine, vol. 27, no. 3, pp. 20–34, 2010.
[32] L. Huang, S. Bi, and Y. A. Zhang, “Deep reinforcement learning for online computation offloading in wireless powered mobile-edge computing networks,” IEEE Trans. Mob. Comput., vol. 19, no. 11, pp. 2581–2593, 2020.
[33] J. Tang, B. Shim, and T. Q. S. Quek, “Service multiplexing and revenue maximization in sliced C-RAN incorporated with URLLC and multicast eMBB,” IEEE J. Sel. Areas Commun., vol. 37, no. 4, pp. 881–895, 2019.