This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Virtual Reality in Metaverse over Wireless Networks with User-centered Deep Reinforcement Learning

Wenhan Yu Interdisciplinary Graduate Programme
Nanyang Technological University
wenhan002@e.ntu.edu.sg
   Terence Jie Chua Interdisciplinary Graduate Programme
Nanyang Technological University
terencej001@e.ntu.edu.sg
   Jun Zhao School of Computer Science & Engineering
Nanyang Technological University
junzhao@ntu.edu.sg
Abstract

The Metaverse and its promises are fast becoming reality as maturing technologies are empowering the different facets. One of the highlights of the Metaverse is that it offers the possibility for highly immersive and interactive socialization. Virtual reality (VR) technologies are the backbone for the virtual universe within the Metaverse as they enable a hyper-realistic and immersive experience, and especially so in the context of socialization. As the virtual world 3D scenes to be rendered are of high resolution and frame rate, these scenes will be offloaded to an edge server for computation. Besides, the metaverse is user-center by design, and human users are always the core. In this work, we introduce a multi-user VR computation offloading over wireless communication scenario. In addition, we devised a novel user-centered deep reinforcement learning approach to find a near-optimal solution. Extensive experiments demonstrate that our approach can lead to remarkable results under various requirements and constraints.

Index Terms:
Metaverse, computation offloading, reinforcement learning, wireless networks

I Introduction

Background. Maturing technologies in areas such as 6G wireless networks [1] and high-performance extended reality (XR) technology [2] has empowered the developments of the Metaverse [3]. One of the key developments of the Metaverse is highly interactive and immersive socialization. Users can interact with one another via full-body avatars, improving the overall socialization experience.

Motivation. Virtual Reality is a key feature of an immersive Metaverse socialization experience. Compared to traditional two-dimensional images, generating 360360^{\circ} panoramic images for the VR experience is computationally intensive. However, the rendering and computation of scenes of high resolution and frame rate are still not feasible on existing VR devices, due to the lack of local device computing power. A feasible solution to powering an immersive socialization experience on VR devices is through computation offloading [4]. In addition, the metaverse is a user-centric application by design, and we need to place the user experience at the core of the network design [5]. Therefore, we have to consider a multi-user socialization scenario, in which each user has a different purpose of use and requirements. This propels us to seek a more user-centered and oriented solution.

Related work. In recent years, VR services over wireless communication have been thoroughly studied in many previous works. Chen et al. studied the quality of service of a VR service over wireless communication using an echo state network [6]. However, none of the previous works considered the varying purpose of use and requirements between users. Although MEC-based VR services are thoroughly studied, few of them considered a sequential scenario over wireless communication. Machine-learning-based approaches have been widely adopted to tackle wireless communication challenges [7, 8], and DRL has been proven to achieve excellent performance. Meng et al. [9] addressed the synchronization between physical objects and the digital models in Metvaerse with deep reinforcement learning (DRL). This is due to the ability of DRL agents to explore and exploit in self-defined environments[10]. However, there are no existing works which has designed a user-centered and user-oriented DRL method.

Approach. This paper proposes a novel multi-user VR model in a downlink Non-Orthogonal Multiple Access (NOMA) system [11]. We designed a novel DRL algorithm that considers the varying purpose of use and requirements of the users. We re-design the Proximal Policy Optimization (PPO) algorithm [12] with a reward decomposition structure.

Contributions. Our contributions are as follows:

  • User-centered Computation Offloading VR Formulation: We study the user-centered Metaverse computation offloading over the wireless network, designing a multi-user scenario where a Virtual Service Provided (VSP) assists users in generating reality-assisted virtual environments.

  • HRPPO: We crafted a novel DRL algorithm Hybrid Reward PPO (HRPPO) to tackle the proposed channel allocation problem. The HRPPO is imbued with the hybrid reward architecture (HRA), enabling it to have a more user-centred perspective.

  • DRL Scenario Design: The design of the three core DRL elements: state, action, and reward are explained in detail. Extensive experiments demonstrate the effectiveness of our method.

The rest of the paper is organized as follows. Section II introduces our system model. Then Section III and IV proposes our deep reinforcement learning approach and settings. In Section V-A, extensive experiments are performed, and various methods are compared to show the prowess of our strategy. Section VI concludes the paper.

Refer to caption
Figure 1: Virtual reality in the Metaverse over wireless networks.

II System model

Consider a multi-user wireless downlink transmission in an indoor environment, in which one second is divided into TT time steps. To ensure a smooth experience, we consider a slotted structure with the clock signal sent by the server for synchronization, and each slot contains one high resolution frame transmission, and the duration for each time slot is ι\iota (ι=1T\iota=\frac{1}{T}). In each time step, a sequence of varying resolution 3D scenes is generated by the VSP and sent to nn VR device users (VU) 𝒩={1,2,,N}\mathcal{N}=\{1,2,...,N\} with distinct characteristics (e.g., Computation capability) and different requirements (e.g., Tolerant delay). Each user is selected for either (1) Computation offloading by offloading the tracking vectors χn\chi_{n} [13] to the virtual service provider (VSP) for scene rendering, or (2) Local computing by receiving scenes (tracking vectors) from others and render scenes locally with a lower computation capability and at the expense of energy consumption. If a user is selected for computation offloading, the VSP will generate the frame and send it back to them via a set of channels ={1,2,,m}\mathcal{M}=\{1,2,...,m\},

Each user can accept virtual scene frame rates as low as a minimum tolerable frames per second (FPS) τn,F\tau_{n,F}, which is the number of successfully received frames in a second. Considering that the tracking vectors are relatively very small [13], we assume that the vectors are transmitted with dedicated channels between VSP-VU and VU-VU, and neglect the overhead.

We use Γt={Γ1t,Γ2t,,ΓNt}\Gamma^{t}=\{\Gamma_{1}^{t},\Gamma_{2}^{t},...,\Gamma_{N}^{t}\} to denote the selection of downlink channel arrangement and inherently, the computing method (VSP or local computation). Γnt=m\Gamma_{n}^{t}=m indicates VU nn is arranged to channel mm at time step tt, and Γnt=0\Gamma_{n}^{t}=0 means user nn needs to generate locally.

Thus, it is imperative to devise a comprehensive algorithm that takes into account the varying satisfaction threshold and requirements of the users. In the next section, we explain the computation offloading and local computing models in detail. The system model is shown in Fig.1.

II-A Computation offloading model

We first introduce the computation offloading model based on the wireless cellular network. The server VSP will manage the downlink channels \mathcal{M} of all VUs 𝒩\mathcal{N}. And then, the server VSP Furthermore, we denote DntD_{n}^{t} (n𝒩n\in\mathcal{N}) as the size of the virtual scene frame at time step tt that needs to be transmitted to user nn.

We adopt the Non-Orthogonal Multiple Access (NOMA) system as this work’s propagation model. In NOMA system, several users can be multiplexed on one channel by the successive interference cancellation (SIC) and superposition coding, and the received signals of VUs in channel mm are sorted in ascending order: p1|h1,mt|2>p2|h2,mt|2>>pN|hN,mt|2p_{1}|h_{1,m}^{t}|^{2}>p_{2}|h_{2,m}^{t}|^{2}>...>p_{N}|h_{N,m}^{t}|^{2} [11]. In this paper, we assume the decoders of the VUs can recover the signals from each channel through SIC. We denote hi,mth_{i,m}^{t} as the channel gain between the VSP and the nthn^{th} user allocated to channel mm at time step (iteration) tt. The Downlink rate can be expressed as [11]:

rnt=Wlog(1+pn|hn,mt|2i=n+1Npi|hn,mt|2+Wσ2).\displaystyle{r}_{n}^{t}=W\log\left(1+\frac{{p}_{n}|h_{n,m}^{t}|^{2}}{\sum_{i=n+1}^{N}{p}_{i}|h_{n,m}^{t}|^{2}+W\sigma^{2}}\right). (1)

Pd={p1,p2,,pN}P_{d}=\{p_{1},p_{2},...,p_{N}\} denotes the transmission power of each VU’s device. Note that transmission power is not time-related in our scenario. hn,mt=gn,mtlnαh_{n,m}^{t}=g_{n,m}^{t}l_{n}^{-\alpha} denotes the channel gain between VU nn and VSP in channel mm, with gn,mtg_{n,m}^{t}, lnl_{n}, α\alpha being the Rayleigh fading parameter, the distance between VU nn and VSP, and the path loss exponent, respectively. WW is the bandwidth of each channel, and Wσ2W\sigma^{2} denotes the background noise. Accordingly, the total delay dn,od_{n,o} of each frame in time step tt is divided into (1) Execution time and (2) downlink transmission time:

dn,ot=Dnt×Cntfv+Dntrnt.\displaystyle d_{n,o}^{t}=\frac{D_{n}^{t}\times C_{n}^{t}}{f_{v}}+\frac{D_{n}^{t}}{r_{n}^{t}}. (2)

where fvf_{v} is the computation capability of VSP, and CntC_{n}^{t} is the required number of cycles per bit of this frame [14].

II-B Local computing model

When VU is not allocated a channel, it needs to generate the virtual world frames locally at the expense of energy consumption. Let fnf_{n} be the computation capability of VU nn, and it varies across VUs. Adopting the model from [15], the energy per cycle can be expressed as en,cyc=ηfn2e_{n,cyc}=\eta f_{n}^{2}. Therefore, the overhead of local computing in terms of execution delay and energy can be derived as:

dn,lt=Dnt×Cntfn.\displaystyle d_{n,l}^{t}=\frac{D_{n}^{t}\times C_{n}^{t}}{f_{n}}. (3)
en,lt=μn×Dnt×Cnt×en,cyc.\displaystyle e_{n,l}^{t}=\mu_{n}\times D_{n}^{t}\times C_{n}^{t}\times e_{n,cyc}. (4)

where μn\mu_{n} is the weighting parameter of energy for VU nn. The battery state of each VU can be different, then, we assume that μn\mu_{n} is closer to 0 with a higher battery.

II-C Problem formulation

With the slotted structure, we set the VUs’ maximum tolerable delay to be ι\iota for every frame to be the problem constraint. Different users have different purpose of use (video games, group chat, etc.). Thus, they also have varying expectations of satisfactory number of frames per second τn,F\tau_{n,F}. We set the tolerable frame transmission failure count of VU nn as τn,f\tau_{n,f}. Initially, the tolerable frame transmission failure count of VU nn is defined as τn,f0=Tτn,F\tau_{n,f}^{0}=T-\tau_{n,F}. For each successive frame, the delay in exceedance of tolerable threshold leads to a decrease in VU’s remaining tolerable count: τn,ft+1=τn,ftInt\tau_{n,f}^{t+1}=\tau_{n,f}^{t}-I_{n}^{t}, where

Int={1,ifdn,ot>ιordn,lt>ι.0,else.\displaystyle I_{n}^{t}=\begin{cases}1,&if~{}~{}d_{n,o}^{t}>\iota~{}or~{}d_{n,l}^{t}>\iota.\\ 0,&else.\end{cases} (5)

Our goal is to find the near-optimal channel arrangement for the transmission of TT frames, to minimize the total frame transmission failure count and VU device energy consumption.

minΓ1,,ΓT\displaystyle\min\limits_{\Gamma^{1},...,\Gamma^{T}} n𝒩t=0T[ω1Int+ω2en,lt].\displaystyle\sum_{n\in\mathcal{N}}\sum_{t=0}^{T}\left[\omega_{1}I_{n}^{t}+\omega_{2}e_{n,l}^{t}\right]. (6)
s.t.C1:\displaystyle s.t.~{}C1: τn,ft0,n𝒩,t[0,T].\displaystyle\tau_{n,f}^{t}\geq 0,~{}\forall n\in\mathcal{N},\forall t\in[0,T]. (7)
C2:\displaystyle C2: Γnt={0,1,,M},n𝒩,t[0,T].\displaystyle\Gamma_{n}^{t}=\{0,1,...,M\},~{}\forall{n}\in\mathcal{N},\forall{t}\in[0,T]. (8)

The ω1,ω2\omega_{1},\omega_{2} are the weighting parameters of delay and energy. Constraint C1C1 ensures that the frame transmission failure count of each user is within their tolerable limit. Constraint C2C2 is our integer optimization variable which denotes the computing method and channel assignment for each user at every time step.

This formulated problem is sequential, where the remaining tolerable frame transmission failure count τnf,t\tau_{n}^{f,t} of each user changes over time, and influences the following states. Thus, convex optimization methods are unsuitable for our proposed problem due to the huge space of integer variables and daunting computational complexity. Also, as the problem contains too many random variables, model-based RL approaches which require transition probabilities are infeasible techniques to tackle our proposed problem. We next introduce our deep RL environment settings according to the formulated problem.

III Deep reinforcement learning setting

For a reinforcement learning environment (problem), the most important components are (1) State: the key factors for an agent to make a decision. (2) Action: the operation decided by an agent to interact with the environment. (3) Reward: the feedback for Agent to evaluate the action under this state. Thus, we expound on these three components next.

III-A State

We included the following attributes into the state: (1) Each VU’s virtual world frame size: DntD_{n}^{t}. (2) Each VU’s remaining tolerable frame transmission failure count: τn,ft\tau_{n,f}^{t}. (3) The channel gain of each VU: hn,mth_{n,m}^{t}. (4) The remaining number of frames to be transmitted at each time step: (Tt)(T-t).

Refer to caption
Figure 2: UL action encoding method.

III-B Action

The discrete action channel assignment to each VU is:

aut=Γt={Γ1t,Γ2t,,ΓNt}.\displaystyle a_{u}^{t}=\Gamma^{t}=\{\Gamma_{1}^{t},\Gamma_{2}^{t},...,\Gamma_{N}^{t}\}. (9)
s.t.Γnt{0,1,,M}.\displaystyle s.t.~{}~{}\Gamma^{t}_{n}\in\{0,1,...,M\}. (10)

In practice, we use a tuple in which there are NN elements corresponding to NN users and each element can take M+1M+1 values, which corresponds to the number of channels; plus 1 for a user being assigned to perform local computing. However, we need to encode the discrete actions as discrete numbers to be evaluated by the neural network. The encoding method is shown in Fig. 2.

III-C Reward

As the main objective is to minimize the frame transmission failure counts and energy consumption, the overall reward RntR_{n}^{t} for each VU contains: (1) a penalty Rn,ftR_{n,f}^{t} for every frame transmission failure and (2) a weighted reward Rn,etR_{n,e}^{t} for energy consumption corresponding to VU’s battery life. To implement the tolerance constraint C1C1, we give (3) a huge penalty Rn,endtR_{n,end}^{t} corresponding to the number of frames left to be transmitted when any VU’s remaining tolerable frame transmission failure count is 0. In the circumstance of (3), the episode ends immediately.

IV Deep Reinforcement learning Approach

Our proposed Hybrid Reward Proximal Policy Optimization (HRPPO) is based on Proximal Policy Optimization (PPO) algorithm, which is considered as the state-of-art RL algorithm [12]. HRPPO is inspired by the Hybrid Reward Architecture (HRA) [16]. Thus, PPO and HRA preliminaries will first be introduced. We will then explain the HRPPO.

IV-A Preliminary

IV-A1 Proximal Policy Optimization (PPO)

As we emphasize on developing a user-centred model which considers VUs’ varying purpose of use and requirements, policy stability is essential. Proximal Policy Optimization (PPO) by openAI [12] is an enhancement of the traditional policy gradient algorithm. PPO has better sample efficiency by using a separate policy for sampling, and is more stable by embedding policy constraint.

In summary, PPO has two main characteristics in its policy network (Actor): (1) Increased sample efficiency. PPO uses a separate policy for sampling trajectories (during training) and evaluating (during evaluation) to increase sample efficiency as well. Here we use πθ\pi_{\theta} as the evaluating policy and πθ\pi_{\theta_{{}^{\prime}}} as the data sampling policy. As we use the πθ\pi_{\theta_{{}^{\prime}}} to sample data for training, the expectation can be rewritten as:

𝔼(st,at)πθ[πθ(at|st)At]\displaystyle\mathbb{E}_{(s^{t},a^{t})\sim\pi_{\theta}}[\pi_{\theta}(a^{t}|s^{t})A^{t}] =𝔼(st,at)πθ[πθ(at|st)πθ(at|st)At].\displaystyle=\mathbb{E}_{(s^{t},a^{t})\sim\pi_{\theta_{{}^{\prime}}}}[\frac{\pi_{\theta}(a^{t}|s^{t})}{\pi_{\theta_{{}^{\prime}}}(a^{t}|s^{t})}A^{t}]. (11)

(2) Policy constraint. After switching the data sampling policy from πθ\pi_{\theta} to πθ\pi_{\theta_{{}^{\prime}}}, an issue still remains. Although in the equation (11), they have the similar expectation value of their objective functions, their variances are starkly distinct. Therefore, a KL-divergence penalty can be added as a constraint to the reward formulation to constrain the distances. However, the KL divergence is impractical to calculate in practice as this constraint is imposed on every observation. Thus, we rewrite the objective function as: 𝔼(st,at)πθ[ft(θ)At]\mathbb{E}_{(s^{t},a^{t})\sim\pi_{\theta_{{}^{\prime}}}}[f^{t}(\theta)A^{t}] [12], where

ft(θ)=min{rt(θ),clip(rt(θ),1ϵ,1+ϵ)}.\displaystyle f^{t}(\theta)=min\{r^{t}(\theta),clip(r^{t}(\theta),1-\epsilon,1+\epsilon)\}. (12)

And rt(θ)=πθ(at|st)πθ(at|st)r^{t}(\theta)=\frac{\pi_{\theta}(a^{t}|s^{t})}{\pi_{{\theta^{\prime}}}(a^{t}|s^{t})}. The problem is solved by gradient ascent, therefore, the gradient can be written as:

Δθ=𝔼(st,at)πθ[ft(θ)At].\displaystyle\Delta\theta=\mathbb{E}_{(s^{t},a^{t})\sim\pi_{\theta_{{}^{\prime}}}}[\triangledown f^{t}(\theta)A^{t}]. (13)

In terms of the value network (Critic), PPO uses identical Critic as per other Actor-Critic algorithms; and the loss function can be formulated in [12] as:

L(ϕ)=[Vϕ(st)(At+Vϕ(st))]2.\displaystyle L(\phi)=[V_{\phi}(s^{t})-(A^{t}+V_{\phi^{\prime}}(s^{t}))]^{2}. (14)

V(s)V(s) is the widely used state-value function [17], which is estimated by a learned critic network with parameter ϕ\phi. We update ϕ\phi by minimizing the L(ϕ)L(\phi), and the parameter ϕ\phi^{\prime} of the target state-value function periodically with ϕ\phi. Using target value is a prevailing trick in RL, which has been used in many algorithms [17].

IV-A2 Hybrid Reward Architecture (HRA)

High-dimensional objective functions are common in communication problems, especially for multi-user scenarios, since we usually need to consider multiple factors and distinct requirements of different users. This issue of using RL to solve a high dimensional objective function was first studied in [16]. In their work, they proposed the HRA structure for Deep Q-learning (DQN) which aims to decompose high-dimensional objective functions into several simpler objective functions. HRA has remarkable performance in handling high-dimensional objectives, which serves as the inspiration for our work.

Refer to caption
Figure 3: Hybrid Reward PPO.

IV-B HRPPO

In contrast to decomposing the overall reward into separate sub-goal rewards as done in [16], we built a user-centered reward decomposition architecture as an extension to PPO, Hybrid Reward PPO (HRPPO), which takes in the rewards of different users and calculates the actions-values separately. In other words, we give the network a view of the state-value of each user, instead of merely evaluating the overall value of an action based on an overall state-value.

Function process: In each episode, when the current transmission is accomplished with the selected action ata_{t}, the environment will issue rewards R1t,R2t,,RntR_{1}^{t},R_{2}^{t},...,R_{n}^{t} as feedback to different VUs. These rewards along with their corresponding states and next iteration state, will be sent to the Critic to generate the state-values V1t,V2t,,VntV_{1}^{t},V_{2}^{t},...,V_{n}^{t}, representing the state-value of each VU. The state-value is then used to calculate the advantages and losses for each VU. The above-mentioned process is illustrated in Fig. 3.

Update function: In equation (13), we established the policy gradient for PPO Actor, and in HRPPO we have the gradient Δθ\Delta\theta as:

Δθ=𝔼(st,at)πθ[ft(θ,(n=1NAnt(st))].\displaystyle\Delta\theta=\mathbb{E}_{(s^{t},a^{t})\sim\pi_{{\theta}^{\prime}}}[\triangledown f^{t}(\theta,(\sum_{n=1}^{N}A_{n}^{t}(s^{t}))]. (15)

where AntA_{n}^{t} denotes the advantages of different VUs. The generalized advantage estimation (GAE) [18] is chosen as the advantage function:

Ant=δt+(γλ)δnt+1++(γλ)T¯1δnt+T¯1,\displaystyle A_{n}^{t}=\delta^{t}+(\gamma\lambda)\delta_{n}^{t+1}+...+(\gamma\lambda)^{\bar{T}-1}\delta_{n}^{t+\bar{T}-1}, (16)
whereδnt=Rnt+γVϕ(st+1)Vϕ(st).\displaystyle\text{where}~{}~{}~{}\delta_{n}^{t}=R_{n}^{t}+\gamma V_{\phi^{\prime}}(s^{t+1})-V_{\phi^{\prime}}(s^{t}). (17)

T¯\bar{T} specifies the length of the given trajectory segment, γ\gamma specifies the discount factor, and λ\lambda denotes the GAE parameter. In terms of Critic loss, the equation (14) is formatted into:

L(ϕ)=n=1N(Vϕ,n(st)(Ant+Vϕ,n(st)))2.\displaystyle L(\phi)=\sum_{n=1}^{N}\left(V_{\phi,n}(s^{t})-(A_{n}^{t}+V_{\phi^{\prime},n}(s^{t}))\right)^{2}. (18)

Similar to the renowned centralized training decentralized execution (CTDE) framework [19], the VϕnV_{\phi}^{n} also uses centralized training with equation (18). Therefore, the training time will not scale with the number of users.

IV-B1 Baselines

We also implement some of the most renowned RL algorithms that are capable of tackling problems with a discrete action space.

  • HRDQN. We implemented the hybrid reward DQN following the structure of HRA [16].

  • PPO. The traditional PPO is used as a baseline. The sum of all users’ rewards is selected as the global reward.

  • Random. The random Agent selects actions randomly, which represents the system performance if no channel resource allocation is performed.

IV-C Metrics

We introduce a set of metrics (apart from RL rewards) to evaluate the effectiveness of our proposed methods.

  • Successful frames. The number of successful frames among total TT frames determines the Frame Rate per Second (FPS) of the virtual world scenes, and hence fluidity of the Metaverse VR experience.

  • Energy consumption. We illustrate the total energy consumption in each episode. Lower energy consumption signifies a more effective use of channel resources.

  • Average rate. The average downlink transmission rate of all VUs and frames in each episode is shown to evaluate the trained policy. A higher average rate indicates better allocation of channel resources.

Algorithm 1 HRPPO
0:  Actor θ\theta, Critic ϕ\phi and target network ϕ\phi^{\prime}
1:  for iteration = 1,21,2... do
2:     AgentAgent execute action according to πθ(at|st)\pi_{\theta^{{}^{\prime}}}(a^{t}|s^{t})
3:     Get reward R1t,R2t,,RNtR_{1}^{t},R_{2}^{t},...,R_{N}^{t} and next state st+1s^{t+1}
4:     stst+1s^{t}\leftarrow s^{t+1}
5:     Sample (st,at,(R1t,R2t,,Rnt),st+1s^{t},a^{t},(R_{1}^{t},R_{2}^{t},...,R_{n}^{t}),s^{t+1}) till end
6:     Compute advantages {A1t,,ANtA_{1}^{t},...,A_{N}^{t}} and target values {V1,targt,,VN,targtV_{1,targ}^{t},...,V_{N,targ}^{t}} using current Hybrid Critic
7:     for kk = 1,2,,K1,2,...,K do
8:        Shuffle the data’s order, set batch size bsbs
9:        for jj=0,1,,Tbs10,1,...,\frac{T}{bs}-1 do
10:           Compute gradient for Actors by eq. (15)
11:           Update Actors by gradient ascent
12:           Update Critic with MSE loss using eq. (18)
13:        end for
14:        Assign target network ϕϕ\phi^{\prime}\leftarrow\phi every CC steps
15:     end for
16:  end for

V Experiment results

V-A Numerical Setting

Consider a 30×3030\times 30 m2m^{2} indoor space where multiple VUs are distributed uniformly across the space. We set the number of channels to be 33 in each experiment configuration, and the number of VUs to be from 55 to 88 across the different experiment configurations. The maximum resolution of one frame is 2k (2048×10802048\times 1080) and the minimum is 1080p (1920×10801920\times 1080). Each pixel is stored in 16 bits [20] and the factor of compression is 150 [13]. We randomize the data size of one frame to take values from a uniform distribution in which Dnt[1920×1080×16150,2048×1080×16150]D_{n}^{t}\in[\frac{1920\times 1080\times 16}{150},\frac{2048\times 1080\times 16}{150}]. The flashed rate, TT frames in one second is taken to be 90, which is considered the best rate for VR [13] applications. The bandwidth of each channel is set to 10×18010\times 180 kHZ. The required successful frame transmission count τn,F\tau_{n,F} is uniformly selected from [75,80][75,80], which is higher than the acceptable of 6060 [13]. In terms of channel gain, the small-scale fading follows the Rayleigh distribution and α=2\alpha=2 is the path loss exponent. For all experiments, we use 2×1052\times 10^{5} steps for training, and the evaluation interval is set to be 5050 training steps. As there are several random variables in our environment, all experiments are conducted under global random seeds from 0-10, and the error bands are drawn to better illustrate the model performances.

Refer to caption
(a) Training reward with 6 VUs.
Refer to caption
(b) Successful frames with 6 VUs.
Refer to caption
(c) Energy consumption with 6 VUs.
Refer to caption
(d) Average rate with 6 VUs.
Refer to caption
(e) Training reward with 8 VUs.
Refer to caption
(f) Successful frames with 8 VUs.
Refer to caption
(g) Energy consumption with 8 VUs.
Refer to caption
(h) Average rate with 8 VUs.
Figure 4: Metrics in environments with 6 VUs and 8 VUs. Considering the randomly evolving environment, all experiments are conducted with global random seeds from 0-10, and the error bands are drawn.

V-B Result analysis

We first illustrate the performances of the different models against different metrics in two experimental configurations (shown in Fig. 4): one with 6 VUs and the other with 8 VUs. We then show the overall results for each experimental configurations in Table  I. Results in Table  I are taken from the average of the final 200200 steps.

The training reward, successful frame transmission counts, and average downlink transmission rate show an overall upward trend as training progresses. When pitted against these metrics, HRPPO performed the best out of the tested baseline algorithms. In the experimental setting with 66 VUs, although PPO and HRPPO are able to attain similar peak rewards towards later training stages, HRPPO converges in half the number of training steps taken for PPO to achieve convergence. In the experimental setting with 88 VUs, HRPPO obtains a much higher final reward when compared to PPO. Both HRPPO and PPO achieved higher rewards than HRDQN, and performed better in each metric. HRPPO and PPOs’ performance superiority can be attributed to the PPO’s policy KL penalty and higher sample efficiency. However, HRDQN and PPO fail to find a good solution in more complicated scenarios. In the 66 VU experimental setting, both HRPPO and PPO are able to allocate VUs to a VSP channel for computation offloading in each round, and this is reflected in zero energy spent on local device computation. However, in the experimental setting with 88 VUs, there is insufficient channel resources and all of the three algorithms learn strategies to increase transmission rate and avoid frame rate decrement, by allocating some VUs to perform local computation, which increases energy consumption.

The complete results in Table. I shows that HRPPO obtains the best performance for almost every metric under every scenario. This demonstrates that decomposing the reward and using sum-losses which provides a user-centered view to the RL agent, is good approach to tackling a multi-user computation offloading problem.

TABLE I: Overall results
VU number Reward Successful frames Energy consumption(J) Average rate (Mbps)
HRPPO
55 4.66 88.44 0 28.33
66 3.26 88.35 0 27.56
77 -2.21 83.67 7.05 19.90
88 -5.15 65.69 11.5711.57 15.02
PPO
55 4.414.41 88.0288.02 0 27.4327.43
66 3.183.18 86.2786.27 0 26.6226.62
77 5.97-5.97 61.4261.42 7.087.08 15.0115.01
88 9.02-9.02 29.3829.38 9.12 9.929.92
HRDQN
55 1.131.13 79.3879.38 1.051.05 23.8423.84
66 4.27-4.27 63.9263.92 4.454.45 19.9119.91
77 7.03-7.03 44.1944.19 7.887.88 13.6113.61
88 9.22-9.22 25.8525.85 7.037.03 8.038.03

VI conclusion

In this paper, we study a multi-user VR in the Metaverse mobile edge computing over wireless networks scenario. Multiple users with varying requirements are considered, and a novel user-centered RL algorithm HRPPO is designed to tackle it. Extensive experiment results show that HRPPO has the quickest convergence and achieves the highest reward, which is 45%45\% higher than the traditional PPO. In the future, we will continue to optimize the power allocation to seek more optimal solutions to our proposed problems.

Acknowledgement

This research is partly supported by Singapore Ministry of Education Academic Research Fund under Grant Tier 1 RG90/22, RG97/20, Grant Tier 1 RG24/20 and Grant Tier 2 MOE2019-T2-1-176; partly by the NTU-Wallenberg AI, Autonomous Systems and Software Program (WASP) Project.

References

  • [1] K. B. Letaief, W. Chen, Y. Shi, J. Zhang, and Y.-J. A. Zhang, “The roadmap to 6G: AI empowered wireless networks,” IEEE Communications Magazine, vol. 57, no. 8, pp. 84–90, 2019.
  • [2] I. F. Akyildiz and H. Guo, “Wireless extended reality (xr): Challenges and new research directions,” ITU J. Future Evol. Technol, 2022.
  • [3] Y. Wang, Z. Su, N. Zhang, R. Xing, D. Liu, T. H. Luan, and X. Shen, “A survey on metaverse: Fundamentals, security, and privacy,” IEEE Communications Surveys & Tutorials, 2022.
  • [4] Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief, “A survey on mobile edge computing: The communication perspective,” IEEE Communications Surveys & Tutorials, 2017.
  • [5] L.-H. Lee, T. Braud, P. Zhou, L. Wang, D. Xu, Z. Lin, A. Kumar, C. Bermejo, and P. Hui, “All one needs to know about Metaverse: A complete survey on technological singularity, virtual ecosystem, and research agenda,” arXiv preprint arXiv:2110.05352, 2021.
  • [6] M. Chen, W. Saad, and C. Yin, “Virtual reality over wireless networks: Quality-of-service model and learning-based resource management,” IEEE Transactions on Communications, 2018.
  • [7] G. Aceto, D. Ciuonzo, A. Montieri, and A. Pescapé, “Mobile encrypted traffic classification using deep learning: Experimental evaluation, lessons learned, and challenges,” IEEE Transactions on Network and Service Management, 2019.
  • [8] Y. Lu, X. Huang, K. Zhang, S. Maharjan, and Y. Zhang, “Blockchain empowered asynchronous federated learning for secure data sharing in internet of vehicles,” IEEE Transactions on Vehicular Technology, 2020.
  • [9] Z. Meng, C. She, G. Zhao, and D. De Martini, “Sampling, communication, and prediction co-design for synchronizing the real-world device and digital model in metaverse,” IEEE Journal on Selected Areas in Communications, 2022.
  • [10] N. C. Luong, D. T. Hoang, S. Gong, D. Niyato, P. Wang, Y.-C. Liang, and D. I. Kim, “Applications of deep reinforcement learning in communications and networking: A survey,” IEEE Communications Surveys & Tutorials, vol. 21, no. 4, pp. 3133–3174, 2019.
  • [11] L. Dai, B. Wang, Z. Ding, Z. Wang, S. Chen, and L. Hanzo, “A survey of non-orthogonal multiple access for 5G,” IEEE Communications Surveys & Tutorials, vol. 20, no. 3, pp. 2294–2323, 2018.
  • [12] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint:1707.06347, 2017.
  • [13] E. Bastug, M. Bennis, M. Médard, and M. Debbah, “Toward interconnected virtual reality: Opportunities, challenges, and enablers,” IEEE Communications Magazine, vol. 55, no. 6, pp. 110–117, 2017.
  • [14] C. You, Y. Zeng, R. Zhang, and K. Huang, “Asynchronous mobile-edge computation offloading: Energy-efficient resource management,” IEEE Transactions on Wireless Communications, 2018.
  • [15] Y. Wen, W. Zhang, and H. Luo, “Energy-optimal mobile application execution: Taming resource-poor mobile devices with cloud clones,” in 2012 proceedings IEEE Infocom.   IEEE, 2012, pp. 2716–2720.
  • [16] H. Van Seijen, M. Fatemi, J. Romoff, R. Laroche, T. Barnes, and J. Tsang, “Hybrid reward architecture for reinforcement learning,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  • [17] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.   MIT press, 2018.
  • [18] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-dimensional continuous control using generalized advantage estimation,” arXiv preprint arXiv:1506.02438, 2015.
  • [19] P. K. Sharma, R. Fernandez, E. Zaroukian, M. Dorothy, A. Basak, and D. E. Asher, “Survey of recent multi-agent reinforcement learning algorithms utilizing centralized training,” in Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications III.   SPIE.
  • [20] M. Quest, “Mobile virtual reality media overview.” [Online]. Available: https://developer.oculus.com/documentation/mobilesdk/latest/concepts/mobile-media-overview/