This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Joint Power Allocation and Rate Control for Rate Splitting Multiple Access Networks with Covert Communications

Nguyen Quang Hieu, Dinh Thai Hoang, Dusit Niyato, Diep N. Nguyen,
Dong In Kim, and Abbas Jamalipour
N. Q. Hieu, D. T. Hoang, and D. N. Nguyen are with the School of Electical and Data Engineerng, University of Technology Sydney, Sydney, NSW 2007 (email: hieu.nguyen-1@student.uts.edu.au, {hoang.dinh, diep.nguyen}@uts.edu.au).D. Niyato is with the School of Computer Science and Engineering, Nanyang Technological University, Sinapore (e-mail: dniyato@ntu.edu.sg). D. I. Kim is with the Department of Electrical and Computer Engineering, Sungkyunkwan University (SKKU), Suwon 16419, South Korea (e-mail: dikim@skku.ac.kr).A. Jamalipour is with the School of Electrical and Information Engineering, The University of Sydney, Australia, NSW 2006 (e-mail: a.jamalipour@ieee.org).
Abstract

Rate Splitting Multiple Access (RSMA) has recently emerged as a promising technique to enhance the transmission rate for multiple access networks. Unlike conventional multiple access schemes, RSMA requires splitting and transmitting messages at different rates. The joint optimization of the power allocation and rate control at the transmitter is challenging given the uncertainty and dynamics of the environment. Furthermore, securing transmissions in RSMA networks is a crucial problem because the messages transmitted can be easily exposed to adversaries. This work first proposes a stochastic optimization framework that allows the transmitter to adaptively adjust its power and transmission rates allocated to users, and thereby maximizing the sum-rate and fairness of the system under the presence of an adversary. We then develop a highly effective learning algorithm that can help the transmitter to find the optimal policy without requiring complete information about the environment in advance. Extensive simulations show that our proposed scheme can achieve positive covert transmission rates in the finite blocklength regime and non-saturating rates at high SNR values. More significantly, our achievable covert rate can be increased at high SNR values (i.e., 20 dB to 40 dB), compared with saturating rates of a conventional multiple access scheme.

Index Terms:
Rate splitting multiple access, covert communications, deep reinforcement learning, power allocation, rate control.

I Introduction

Recent years have witnessed the growing interest in six-generation (6G) networks from both academia and industry. It is envisioned that 6G will enable the Internet-of-Things (IoT) in which a massive number of devices can communicate via wireless environments. To accommodate such a growing demand of connections in future wireless networks, a modern multiple access scheme with high efficiency and flexibility is an urgent need. Rate Splitting Multiple Access (RSMA) has recently emerged as a novel communication technique that can flexibly and efficiently manage interference and thus increase the overall performance for the downlink of the wireless systems [2, 1]. In RSMA, each message at the transmitter is first split into two parts, i.e., a common part and private part. The common parts of all the messages are combined into a single common message. The common message is then encoded using a shared codebook. The private messages are independently encoded for the respective users. After receiving these messages, each user decodes the common and private messages with the Successive Interference Cancellation (SIC) to obtain its original message. By partially decoding and partially treating interference as noise, RSMA can enhance spectral efficiency, energy efficiency, and security of multiple access systems [4, 3, 6, 7, 5, 8]. Thanks to its outstanding features, RSMA can tackle many emerging problems in 6G and gain enormous attention from both industry and academia [1].

I-A Challenges of RSMA

Although possessing the advantages, the RSMA scheme is facing two main challenges. Unlike conventional multiple access schemes, e.g., Spatial Division Multiple Access (SDMA) which treats interference as noise or Non-Orthogonal Multiple Access (NOMA) which successively removes multi-user interference during the decoding process, RSMA partially decodes the messages and partially treat multi-user interference as noise. The RSMA transmitter hence needs to jointly optimize the power allocation and rate control for different messages to maximize the energy and spectral efficiency for the whole system. Aiming to address such problems, several research works have been proposed in the literature. In [9], the authors consider a rate splitting approach with heterogeneous Channel State Information at the Transmitter (CSIT). Two groups of CSIT qualities are considered and the transmitter is assumed to have either partial CSIT or no-CSIT. For the no-CSIT scenario, the users decode their messages by treating interference as noise without using a rate splitting strategy. In contrast, for the partial CSIT scenarion, a rate splitting strategy is applied and a fraction of total power is allocated equally among private symbols and common symbols. Simulation results show that with the proposed power allocation scheme, the sum-rate of users in the group with the rate splitting strategy can gain significant improvement compared to those in the group without using any rate splitting strategy. Similarly, in [7], the authors study a rate splitting strategy for a multi-group of users in a large-scale RSMA system. A more complex precoding design for power and rate allocated to users is proposed to find the maximum sum-rate of the system. The simulation results reveal that a precoding design with rate splitting can benefit sum-rate of the system. However, the authors in [7] only consider a perfect CSIT scenario in which the BS is assumed to know exact information of the channel and the channel is assumed to be fixed. In order to relax these assumptions, the authors in [6] and [10] consider a similar system, but under imperfect CSIT. In this case, a stochastic optimization formulation is proposed to deal with the uncertainty of the channel with imperfect CSIT. In [10], the authors show that with rate splitting under imperfect CSIT, the sum-rate of the system can achieve non-saturating rates at high SNR values compared to saturating rates of a conventional scheme, i.e., SDMA. In [6], the authors further investigate the impacts of different error models on the system performance. Numerical results reveal that in addition to the expected sum-rate gains, the benefits of rate splitting also include relaxed CSIT quality requirements and enhanced achievable rate regions compared with a conventional transmission scheme. A comprehensive analysis of RSMA performance is studied in [4]. In this work, through many simulations and performance analysis, they show that the rate splitting techniques are able to softly bridge the two extremes of fully treating interference as noise and fully decoding interference. Thus, in comparison with conventional multiple access approaches, e.g., SDMA and NOMA, RSMA can gain significant rate enhancement. Although the aforementioned works propose solutions to improve system performance for RSMA networks, the channel state distribution (or channel state matrix) is always assumed to be known by the transmitter. In addition, optimization methods in these works also introduce additional variables, e.g., equalizers and weights, that are highly correlated with the channel state. Thus, unavailability or drastic changes of the channel state information, e.g., due to mobility of users of link’s failures [12], can result in significant degradation of these algorithms. Therefore, a more flexible framework that not only deals with the dynamics of the environment but also efficiently manages power allocation and rate control for RSMA without requiring complete or partial information, e.g., posterior distributions, of the channel state, is in an urgent need.

The second challenge that RSMA is facing is security. Although a new data rate region can be achieved with RSMA, investigation on RSMA’s security is still in its early stage. Several works, such as [3, 14] and [13], are proposed to address the eavesdropping issues in RSMA networks. In particular, the authors in [3] propose a cooperative RSMA scheme to enhance the secrecy sum-rate of the system in which the common messages can be used not only as desired messages but also artificial noise. In this case, it is shown that the proposed cooperative secure rate splitting scheme outperforms conventional SDMA and NOMA schemes in terms of secrecy sum-rate. However, [3] only considers the perfect CSIT scenario in which the transmitter has complete information of the channel state. In order to address the challenges caused by imperfect CSIT, the authors in [14] investigate the impacts of imperfect CSIT on the secrecy rate of the system. Specifically, to deal with imperfect CSIT, a worst-case uncertainty channel model is taken into consideration with the goal to mitigate simultaneously inter-user interference and maximize the secrecy sum-rate. The simulation results show the robustness of the proposed solution against the imperfect CSIT of RSMA, and the secure transmission is also guaranteed. Furthermore, in comparison with NOMA, RSMA shows significant secrecy rate enhancement. In [13], a secure beamforming design is also proposed to maximize the weighted sum-rate under user’s secrecy rate requirement. Unlike [3] and [14], the authors in [13] consider the presence of an internal eavesdropper, i.e., an illegitimate user, that not only receives its messages but also wiretaps messages intended for other legitimate users. To deal with this internal eavesdropper, all user’s secrecy rate constraints are taken into consideration. The simulation results suggest that RSMA can outperform the baseline scheme in terms of weighted sum-rate.

All of the above works and others in literature only focus on dealing with the passive eavesdroppers, i.e., the eavesdroppers try to listen passively to the communication channel to derive the original message. To deal with such passive eavesdroppers, the transmitter can adaptively select different transmission rates [13] or utilize artificial noise [3] to confuse the eavesdropper, and thereby minimizing the information disclosure. However, in these cases, the eavesdropper can still detect and receive the signals from the transmitter, and thus it can still decode the information if it has a more powerful hardware computation, e.g., through employing cooperative processing with other eavesdroppers, or better antennas gains compared with those of the transmitter [16, 17]. The passive eavesdropper scenario cannot address the problem in which the eavesdropper is able to manipulate or control the environment [16]. For example, by manipulating the environment, the adversary can bias the resulting bits in the key establishment process [18]. In applications requiring a high security protection, e.g., military and IoT healthcare applications, leaking a small amount of data can result in a break of the whole system and/or cause effects to the users. Therefore, in this work, to further prevent potential information leakage, we focus on a more challenging adversary model in which we need to control the power and transmission rates allocated to users, so that the adversary is unable to detect transmissions on the channels. In this way, the possibility of leaking information can be minimized.

I-B Contributions and Organization

In this work, we aim to develop a novel framework that addresses the aforementioned problems. In particular, we consider a scenario in which an adversary, i.e., a warden, is present in the communication range of a Base Station (BS), i.e., the transmitter, and multiple mobile users, i.e., the receivers. The warden is assumed to be able to observe constantly the channel with a radiometer and interrupt the channel if it detects transmissions on the channel [15, 21]. Thus, it is challenging for the BS to allocate jointly power and control transmission rates for all the messages while hiding these messages from the warden. To minimize the probability of being detected by the warden, a possible policy for the BS is to decrease the power allocated to the messages, so that the warden can be confused the transmitted signals with noise. However, an inappropriate implementation can result in zero data rates at the receivers [15]. To maximize the transmission rate of the system and at the same time guarantee non-zero rate at each user, we formulate the problem of the BS as a max-min fairness problem. In particular, the BS aims to maximize the expected minimum rate (min-rate) of the system under the uncertainty of the environment. Furthermore, we consider a covert constraint that is derived from the theory of covert communications (i.e., low probability of detection communications) [20, 22, 23]. In this way, our proposed framework can help the BS secretly communicate with the legitimate mobile users with a small probability of being detected by the warden.

To find the optimal solution for the optimization formulation above, we develop a learning algorithm based on Proximal Policy Optimization (PPO)  [28]. By leveraging recent advances of deep reinforcement learning techniques [25, 26, 27], our proposed algorithm can effectively find the optimal policy for the BS without requiring complete information of the channel in advance. Specifically, our proposed algorithm takes the feedback from users via the uplink as inputs of the deep neural networks and then outputs the corresponding joint power allocation and rate control policy for the BS. This procedure is similar to the conventional optimization-based schemes in the view of implementation and resource. The differences in our proposed algorithm are twofold. First, our proposed algorithm is a model-free deep reinforcement learning algorithm which does not require a complete model of the environment, i.e., the channel state matrix or channel distribution, in advance. Second, the policy obtained by our proposed algorithm can be adaptively adjusted in cases the channel dynamics change over time, e.g., time-varying channel. Thus, our proposed algorithm is expected to show more flexibility and robustness against the uncertainty and dynamics of the wireless environment. With the obtained optimal policy, we then show that our proposed method can also achieve covert communications between the BS and mobile users by dynamically adjusting power and transmission rates allocated to the messages.

Here, we note that our previous work presented in [24] only focuses on power allocation problem without considering the security impact on the RSMA system. Furthermore, in this current work, we further investigate the rate and security performance of RSMA in the finite blocklength (FBL) regime where the achievable covert rate is limited and no longer follows the Shannon capacity (i.e., infinite blocklength regime). Thus, our framework can be applicable for a wide range of applications which include transmission between IoT devices where the transmitted data is expected to be sporadic and the number of channel uses is limited. In short, our main contributions are as follows:

  • We develop a novel stochastic optimization framework to achieve the max-min fairness of the considered RSMA network with the covert constraint. This framework enables the BS to make optimal decisions to maximize the expected min-rate under the covert requirement as well as the dynamics and uncertainty of surrounding environment. To the best of our knowledge, this is the first work that considers covert communications for RSMA. Therefore, our proposed framework is a promising solution for secure and reliable, high data transmission rate applications.

  • We propose a highly effective learning algorithm to make the best decisions for the BS. This learning algorithm enables the BS to quickly find the optimal policy through feedback from the users by leveraging advantages of both deep learning and reinforcement learning techniques. Furthermore, our proposed learning algorithm can effectively handle the continuous action and state spaces for the BS through using the PPO technique.

  • We conduct intensive simulations to evaluate the efficiency of the proposed framework and reveal insightful information. Specifically, the simulation results show that the positive covert rate is achievable with RSMA in the finite blocklength regime where the achievable covert rate is limited and no longer follows the Shannon capacity (i.e., infinite blocklength regime). More interestingly, with high values of transmission power (i.e., 20 dB to 40 dB), our achievable covert rate can be increased while the achievable covert rate of a baseline multiple access scheme, i.e., SDMA, is saturated. Thus, beyond conventional wireless networks, our framework can be applicable for IoT networks in which the data transmitted between devices is expected to be sporadic and with a relatively small quantity of information .

The rest of our paper is organized as follows. Our system model is described in Section II. Then, we formulate the stochastic optimization problem for the covert-aided RSMA networks in Section III. We provide details of our proposed learning algorithm to maximize the covert rate of the system in Section IV. After that, our simulation results are discussed in Section V, and Section VI concludes the paper.

II System Model

Refer to caption
Figure 1: Covert-aided RSMA system model.

We consider a system that consists of one MM-antenna BS, a set 𝒦={1,2,,K}\mathcal{K}=\{1,2,\ldots,K\} of KK single-antenna legitimate users (MKM\geq K), and a warden as illustrated in Fig. 1. The warden has ability to interrupt the channel if it detects any transmissions from the BS. The BS wants to transmit information to the users with a minimum probability of being detected by the warden [19]. The BS has a set of messages 𝐖={W1,,Wk,,WK}\mathbf{W}=\{W_{1},\ldots,W_{k},\ldots,W_{K}\} to be transmitted to the users. The message intended for user uku_{k}, denoted as WkW_{k}, is split into a common part WkcW_{k}^{c} and a private part WkpW_{k}^{p} (k𝒦\forall k\in\mathcal{K}), with the lengths of lkcl_{k}^{c} and lkpl_{k}^{p}, respectively. The common parts of all KK messages are combined into a single common message WcW^{c}. The single common message WcW^{c} and KK private messages WkpW_{k}^{p} are independently encoded into streams scs_{c} (i.e., common stream), s1,s2,,sKs_{1},s_{2},\ldots,s_{K} (i.e., private streams), respectively. The transmitted signal of the BS is thus defined as follows:

𝐱=𝐩csc+k=1K𝐩ksk,\mathbf{x}=\mathbf{p}_{c}s_{c}+\sum_{k=1}^{K}\mathbf{p}_{k}s_{k}, (1)

where 𝐩c\mathbf{p}_{c} and 𝐩k\mathbf{p}_{k} are the beamforming vectors for the common and private stream scs_{c} and sks_{k}, respectively. Let 𝐡kM×1\mathbf{h}_{k}\in\mathbb{C}^{M\times 1} denote the estimated channel between the BS and user uku_{k}, 𝐡wM×1\mathbf{h}_{w}\in\mathbb{C}^{M\times 1} denote the channel between the BS and the warden. The received signal at user uku_{k} is calculated as follows:

yk=𝐡kH𝐱+nk,y_{k}=\mathbf{h}_{k}^{H}\mathbf{x}+n_{k}, (2)

where nk𝒞𝒩(0,σn,k2)n_{k}\sim\mathcal{CN}(0,\sigma_{n,k}^{2}) is the Additive White Gaussian Noise (AWGN) at the receiver. Note that the estimated channel at the BS is obtained from feedback of the users which contains estimation errors, i.e., 𝐡k=𝐡^k+𝐡~k\mathbf{h}_{k}=\mathbf{\hat{h}}_{k}+\mathbf{\tilde{h}}_{k}, where 𝐡^k\mathbf{\hat{h}}_{k} is the actual channel state and 𝐡~k\mathbf{\tilde{h}}_{k} is the estimation error. The SINRs of the common and private messages, denoted as γkc\gamma_{k}^{c} and γkp\gamma_{k}^{p}, respectively, can be calculated as follows:

γkc(𝐏)=|𝐡kH𝐩c|2j=1K|𝐡k𝐩k|2+1,\displaystyle\gamma_{k}^{c}(\mathbf{P})=\frac{|\mathbf{h}_{k}^{H}\mathbf{p}_{c}|^{2}}{\sum_{j=1}^{K}|\mathbf{h}_{k}\mathbf{p}_{k}|^{2}+1}, (3)
γkp(𝐏)=|𝐡kH𝐩k|2jk|𝐡k𝐩k|2+1,\displaystyle\gamma_{k}^{p}(\mathbf{P})=\frac{|\mathbf{h}_{k}^{H}\mathbf{p}_{k}|^{2}}{\sum_{j\neq k}|\mathbf{h}_{k}\mathbf{p}_{k}|^{2}+1},

where 𝐏=[𝐩c,𝐩1,,𝐩K]\mathbf{P}=[\mathbf{p}_{c},\mathbf{p}_{1},\ldots,\mathbf{p}_{K}] is the beamformer of the BS. The transmission power at the BS is constrained by tr(𝐏𝐏H)Pt\text{tr}(\mathbf{P}\mathbf{P}^{H})\leq P_{t}.

In the FBL regime, the achievable covert rate of the private message WkpW_{k}^{p} at user uku_{k} is calculated as follows [29, 30, 31]:

Rkp(𝐏,lkp)log2(1+γkp)γkp(γkp+2)lkp(γkp+1)2Q1(δk)ln2+log2lkp2lkp,\displaystyle R_{k}^{p}(\mathbf{P},l_{k}^{p})\approx\log_{2}(1+\gamma_{k}^{p})-\sqrt{\frac{\gamma_{k}^{p}(\gamma_{k}^{p}+2)}{l_{k}^{p}(\gamma_{k}^{p}+1)^{2}}}\frac{Q^{-1}(\delta_{k})}{\ln 2}+\frac{\log_{2}l_{k}^{p}}{2l_{k}^{p}}, (4)

where lkpl_{k}^{p} is the length of message WkpW_{k}^{p}. δk\delta_{k} is the decoding error probability at user uku_{k}, and Q1()Q^{-1}(\cdot) is the inverse Q-function of Q(x)=xexp(t2)𝑑tQ(x)=\int_{x}^{\infty}\exp(\frac{-t}{2})dt [29]. To guarantee that the common message WcW^{c} can be correctly decoded by all the users, the achievable covert rate of the common message is calculated by [29, 30]:

Rkc(𝐏,lkc)mink𝒦(log2(1+γkc)γkc(γkc+2)lkc(γkc+1)2Q1(δk)ln2+log2lkc2lkc).\displaystyle R_{k}^{c}(\mathbf{P},l_{k}^{c})\approx\min_{k\in\mathcal{K}}\Big{(}\log_{2}(1+\gamma_{k}^{c})-\sqrt{\frac{\gamma_{k}^{c}(\gamma_{k}^{c}+2)}{l_{k}^{c}(\gamma_{k}^{c}+1)^{2}}}\frac{Q^{-1}(\delta_{k})}{\ln 2}+\frac{\log_{2}l_{k}^{c}}{2l_{k}^{c}}\Big{)}. (5)

Since RkcR_{k}^{c} is shared between users such that CkC_{k} is the user uku_{k}’s portion of the common rate RcR_{c} with k=1KCkRkc\sum_{k=1}^{K}C_{k}\leq R_{k}^{c}. The total achievable rate of the user uku_{k} is then defined by [4]:

Rktot=Ck+Rkp.R_{k}^{tot}=C_{k}+R_{k}^{p}. (6)

As a result, the covert sum-rate of the BS is defined by a sum of RktotR_{k}^{tot} over KK users, i.e., Rs=k=1KRktotR_{s}=\sum_{k=1}^{K}R_{k}^{tot}.

With the presence of noise, the warden needs to make a binary decision, i.e., (i) the BS is transmitting or (ii) the BS is not transmitting, based on its observations [19]. For this, the warden distinguishes two hypotheses 0\mathcal{H}_{0} and 1\mathcal{H}_{1}, where 0\mathcal{H}_{0} denotes the null hypothesis, i.e., the BS is not transmitting, and 1\mathcal{H}_{1} denotes the alternative hypothesis, i.e., the BS is transmitting. In particular, the two hypotheses are defined as follows:

{0:𝐲w=𝐳w1:𝐲w=𝐡wH𝐱+𝐳w,\begin{cases}\mathcal{H}_{0}\ :&\mathbf{y}_{w}=\mathbf{z}_{w}\\ \mathcal{H}_{1}\ :&\mathbf{y}_{w}=\mathbf{h}_{w}^{H}\mathbf{x}+\mathbf{z}_{w},\end{cases} (7)

where 𝐲w\mathbf{y}_{w} and 𝐳w\mathbf{z}_{w} are the received signal and noise signal at the warden, respectively. 𝐱\mathbf{x} is the transmitted signal from the BS. It is noted that the warden does not know the codebook of transmitted signals and the hypothesis test of the warden can be performed as follows. First, the warden collects a row vector of independent readings 𝐲w\mathbf{y}_{w} from his channel to the BS. Then the warden generates the test statistic on the collected vector. The goal of the warden is to minimize the error detection rate, which is given by:

ξ=PF+PM,\xi=P_{F}+P_{M}, (8)

where PF=Pr(𝒟1|0)P_{F}=\text{Pr}(\mathcal{D}_{1}|\mathcal{H}_{0}) is the false alarm probability and PM=Pr(𝒟0|1)P_{M}=\text{Pr}(\mathcal{D}_{0}|\mathcal{H}_{1}) is the miss detection probability. 𝒟1\mathcal{D}_{1} and 𝒟0\mathcal{D}_{0} are the binary decisions of the warden that infer whether the BS is transmitting or not, respectively. We can derive the lower bound of ξ\xi as follows [30]:

ξ112𝒟(0||1),\xi\geq 1-\sqrt{\frac{1}{2}\mathcal{D}(\mathbb{P}_{0}||\mathbb{P}_{1})}, (9)

where 0\mathbb{P}_{0} and 1\mathbb{P}_{1} are the probability distributions of the observations when the BS transmits (i.e., 1\mathcal{H}_{1} is true) or when the BS does not transmit (i.e., 0\mathcal{H}_{0} is true), respectively. 𝒟(0||1)\mathcal{D}(\mathbb{P}_{0}||\mathbb{P}_{1}) is the relative entropy between two probability distributions 0\mathbb{P}_{0} and 1\mathbb{P}_{1}.

Proposition 1

The relative entropy (Kullback–Leibler divergence) between two probability distributions 0\mathbb{P}_{0} and 1\mathbb{P}_{1}, denoted by 𝒟(0||1)\mathcal{D}(\mathbb{P}_{0}||\mathbb{P}_{1}), can be calculated as follows:

𝒟(0|1)=ln(gwPt+σw2)ln(σw)+σw22(gwPt+σw2)12.\mathcal{D}(\mathbb{P}_{0}|\mathbb{P}_{1})=\ln\left(\sqrt{g_{w}P_{t}+\sigma_{w}^{2}}\right)-\ln\left(\sigma_{w}\right)+\frac{\sigma_{w}^{2}}{2\left(g_{w}P_{t}+\sigma_{w}^{2}\right)}-\frac{1}{2}. (10)

Proof of Proposition 1 can be found in Appendix -A.

In covert communications, we normally have ξ1ϵ\xi\geq 1-\epsilon as the covertness requirement, where ϵ\epsilon is an arbitrarily small value [30]. Following (9), in this paper, we adopt 𝒟(0||1)2ϵ2\mathcal{D}(\mathbb{P}_{0}||\mathbb{P}_{1})\leq 2\epsilon^{2} as the covertness requirement.

In the covert-aided systems, the achievable data rate is usually small or asymptotically approached zero [19, 30]. To achieve the maximum rate of the system and non-zero data rate for each user, we consider a max-min fairness problem in which the optimization problem is formulated as maximizing the expected minimum data rate (min-rate) among users [7]. Let 𝐋=[l1c,l1p,,lKc,lKp]\mathbf{L}=[l_{1}^{c},l_{1}^{p},\ldots,l_{K}^{c},l_{K}^{p}] denote the message-splitting vector and 𝐂=[C1,C2,,CK]\mathbf{C}=[C_{1},C_{2},\ldots,C_{K}] denote the common rates for the common messages. The stochastic optimization problem of the BS is then formulated as follows:

max𝐏,𝐋,𝐂\displaystyle\max_{\mathbf{P},\mathbf{L},\mathbf{C}}\quad mink𝒦R¯ktot\displaystyle\min_{k\in\mathcal{K}}\bar{R}_{k}^{tot} (11a)
s.t. tr(𝐏𝐏H)Pt,\displaystyle\text{tr}(\mathbf{P}\mathbf{P}^{H})\leq P_{t}, (11b)
lkc+lkp=Lk,k𝒦,\displaystyle l_{k}^{c}+l_{k}^{p}=L_{k},\forall k\in\mathcal{K}, (11c)
k𝒦CkRc,k𝒦,\displaystyle\sum_{k\in\mathcal{K}}C_{k}\leq R_{c},\forall k\in\mathcal{K}, (11d)
𝒟(0||1)2ϵ2,\displaystyle\mathcal{D}(\mathbb{P}_{0}||\mathbb{P}_{1})\leq 2\epsilon^{2}, (11e)
RktotRk0,\displaystyle R_{k}^{tot}\geq R_{k}^{0}, (11f)

where R¯ktot=𝔼𝐡k𝐇{Rktot}\bar{R}_{k}^{tot}=\mathbb{E}_{\mathbf{h}_{k}\in\mathbf{H}}\{R_{k}^{tot}\} is the average rate of the system with 𝐇={𝐡k|𝐡k=𝐡^k+𝐡~k;k𝒦}\mathbf{H}=\{\mathbf{h}_{k}|\mathbf{h}_{k}=\mathbf{\hat{h}}_{k}+\mathbf{\tilde{h}}_{k};\forall k\in\mathcal{K}\} is the channel matrix of the system. Rk0R_{k}^{0} is the minimum rate requirement (QoS) of user uku_{k} and LkL_{k} is length of the message WkW_{k}. The problem in (11) can be described as follows. (11b) and (11c) illustrate the power constraint and packet length constraint of the BS, respectively. (11d) is the common rate constraint. (11e) and (11f) are covert constraint and QoS constraint, respectively. Optimizing (11) is very challenging under the dynamics and uncertainty of the communication channel, i.e., the channel gain 𝐡k\mathbf{h}_{k} between the BS and user uku_{k} changes over time, and the channel state is unknown to the BS. In this paper, we thus propose a deep reinforcement learning approach to obtain the optimal policy for the BS under the dynamics and uncertainty of the environment. It is noted that we only use the channel matrix 𝐇\mathbf{H} in the optimization problem above to illustrate the stochastic nature of the system. In the next section, we show that the optimization problem (11) can be transformed into maximizing the expected discounted reward in the deep reinforcement learning setting without requiring any information from channel matrix 𝐇\mathbf{H}. Details of notations used in this paper are summarized in Table I.

TABLE I: Summary of notations.
Variable Definition
KK Number of users
MM Number of antennas at the BS
WkW_{k} Message intended to transmit to user uku_{k}
Wkc,WkpW_{k}^{c},W_{k}^{p} Common and private parts split from WkW_{k}
LkL_{k} Length of WkW_{k} (bits)
lkc,lkpl_{k}^{c},l_{k}^{p} Lengths of WkcW_{k}^{c} and WkpW_{k}^{p}
PtP_{t} Transmission power of the BS
𝐏\mathbf{P} Transmission beamformer of the BS
𝐋\mathbf{L} Vector of messages’ lengths at the BS
𝐂\mathbf{C} Common rate vector allocated to users
γkc(𝐏),γkp(𝐏)\gamma_{k}^{c}(\mathbf{P}),\gamma_{k}^{p}(\mathbf{P}) SINRs of the common and private messages at user uku_{k}
𝐡k\mathbf{h}_{k} Channel between the BS and user uku_{k}
𝐡w\mathbf{h}_{w} Channel between the BS and warden
Rkc(𝐏,lkc),Rkp(𝐏,lkp)R_{k}^{c}(\mathbf{P},l_{k}^{c}),R_{k}^{p}(\mathbf{P},l_{k}^{p}) Achievable covert rates of WkcW_{k}^{c} and WkpW_{k}^{p}
RktotR_{k}^{tot} Achievable rate of user uku_{k}
Rs(𝐏,Lk)R_{s}(\mathbf{P},L_{k}) Achievable (covert) sum-rate
ϵ\epsilon Covert requirement
𝒟(0||1)\mathcal{D}(\mathbb{P}_{0}||\mathbb{P}_{1}) Relative entropy between two probability distributions 0\mathbb{P}_{0} and 1\mathbb{P}_{1}
Rk0R_{k}^{0} Minimum rate requirement of user uku_{k}
𝒮,𝒜\mathcal{S},\mathcal{A} State space and action space of the BS
st,at,rt(st,at)s_{t},a_{t},r_{t}(s_{t},a_{t}) State, action, and reward of the BS at time step tt
ptp_{t} Penalty of the BS for taking action ata_{t}
Ωθ,θ\Omega_{\theta},\theta Policy and policy parameter vector of the BS

III Problem Formulation

III-A Deep Reinforcement Learning

Before introducing our mathematical formulation, we first describe the fundamentals of DRL. In conventional reinforcement learning (RL) settings, an agent aims to learn an optimal policy through interacting with an environment in discrete decision time steps. At each time step tt, the agent first observes its current state sts_{t} in a state space 𝒮\mathcal{S} of the system. Based on the observed state sts_{t} and current policy Ω\Omega, the agent takes an action ata_{t} in the action space 𝒜\mathcal{A}. The policy Ω\Omega can be a mapping function from a state to an action (deterministic) or a probability distribution over actions (stochastic). After taking the action ata_{t}, the agent transits to a new state st+1s_{t+1} and observes an immediate reward rtr_{t}. The goal of the agent is to find an optimal policy that can be obtained by maximizing a discounted cumulative reward.

In conventional RL settings, the agent usually deals with a policy search problem in which the convergence time of the RL algorithm depends on the search space 𝒮\mathcal{S} and 𝒜\mathcal{A}. In environments with large discrete state-action space or continuous state-action space, the optimal policy is either nearly impossible or time-consuming to find. To address this problem, RL algorithms combined with deep neural networks, namely DRL, show significant performance improvements over conventional RL algorithms [33]. In DRL algorithms, the policy Ω\Omega is defined by a probability distribution over actions, i.e., Ωθ=Pr{at|st;θ}\Omega_{\theta}=Pr\{a_{t}|s_{t};\theta\}, where θ\theta is a parameter vector of the deep neural network. The parameter vector Ωθ\Omega_{\theta} can be trained by action-value methods, e.g., DQN [33], or policy gradient methods, e.g., PPO [28]. Action-value methods and policy gradient methods have their advantages and drawbacks which we will discuss later in Section III-C. In the following, we formulate our considered problem in the DRL setting by defining state space, action space, and immediate reward function in which the BS is empowered by an intelligent DRL agent.

III-B DRL-based Optimization Framework

We introduce the proposed DRL-based optimization framework for the joint power allocation and transmission rate control problem of the BS as follows. The state space of the BS is defined by:

𝒮={{𝐡k,Lk};1kK},\mathcal{S}=\Big{\{}\{\mathbf{h}_{k},L_{k}\};1\leq k\leq K\Big{\}}, (12)

where 𝐡k\mathbf{h}_{k} is the channel state feedback of the user uku_{k} to the BS. LkL_{k} is the length of the message WkW_{k} intended for user uku_{k}. Note that the channel state feedback from the users contains estimation errors, i.e., 𝐡k=𝐡^k+𝐡~k\mathbf{h}_{k}=\mathbf{\hat{h}}_{k}+\mathbf{\tilde{h}}_{k}, where 𝐡^k\mathbf{\hat{h}}_{k} is the actual channel state and 𝐡~k\mathbf{\tilde{h}}_{k} is the estimation error. The channel of user uku_{k} is realized as

𝐡^k=gk×[1,ejϕk,ej2ϕk,ej3ϕk],\hat{\mathbf{h}}_{k}=g_{k}\times[1,e^{j\phi_{k}},e^{j2\phi_{k}},e^{j3\phi_{k}}], (13)

where gkg_{k}\in\mathbb{R} and ϕ\phi\in\mathbb{R} are control variables [4]. The channel estimation error follows a complex Gaussian distribution, i.e., 𝐡~k𝒞𝒩(0,σk2)\tilde{\mathbf{h}}_{k}\sim\mathcal{CN}(0,\sigma_{k}^{2}), where σk2\sigma_{k}^{2} is inversely proportional to the transmission power at the BS, i.e., σk2=gkPtαk\sigma_{k}^{2}=g_{k}P_{t}^{-\alpha_{k}}where αk\alpha_{k} is the degree of freedom (DoF) variable [4]. Note that the channel 𝐡w\mathbf{h}_{w} of the warden is unknown to the BS and thus 𝐡w\mathbf{h}_{w} is not included in the state space of the BS. We define the channel between the warden and the BS as follows:

𝐡w=gw×[1,ejϕw,ej2ϕw,ej3ϕw],\mathbf{h}_{w}=g_{w}\times[1,e^{j\phi_{w}},e^{j2\phi_{w}},e^{j3\phi_{w}}], (14)

At each time step tt, the BS allocates the transmission power to the users, splits the messages to common and private messages, and controls transmission rate for the messages. Thus, the action space of the BS is defined as follows:

𝒜={𝐏,𝐋,𝐂}.\mathcal{A}=\{\mathbf{P},\mathbf{L},\mathbf{C}\}. (15)

The reward function is designed to maximize the min-rate of the BS as in (11). To encourage the BS to optimize the min-rate while the covert and QoS constraints of users are taken into consideration, we penalize the BS for each violated constraint. For this, the immediate reward can be defined as follows:

rt(st,at)={mink𝒦Rktot,ifpt=0,0,ifpt>0.r_{t}(s_{t},a_{t})=\begin{cases}\min_{k\in\mathcal{K}}R_{k}^{tot},&\text{if}\ p_{t}=0,\\ 0,&\text{if}\ p_{t}>0.\end{cases} (16)

where ptp_{t} is the penalty received by the BS for action ata_{t} that does not satisfy the covert constraint and QoS constraint in (11). The penalty ptp_{t} is defined as follows:

pt\displaystyle p_{t} =β0𝟏(𝒟(01)2ϵ2)+k=1Kβk𝟏(Rk0Rktot)β0+k=1Kβk\displaystyle=\frac{\beta_{0}\mathbf{1}\left(\mathcal{D}\left(\mathbb{P}_{0}\|\mathbb{P}_{1}\right)-2\epsilon^{2}\right)+\sum_{k=1}^{K}\beta_{k}\mathbf{1}\left(R_{k}^{0}-R_{k}^{tot}\right)}{\beta_{0}+\sum_{k=1}^{K}\beta_{k}} (17)
=β0β0+k=1Kβk𝟏(𝒟(0||1)2ϵ2)Covert penalty+1β0+k=1Kβkk=1Kβk𝟏(Rk0Rktot)QoS penalty,\displaystyle=\underbrace{\frac{\beta_{0}}{\beta_{0}+\sum_{k=1}^{K}\beta_{k}}\mathbf{1}\big{(}\mathcal{D}(\mathbb{P}_{0}||\mathbb{P}_{1})-2\epsilon^{2}\big{)}}_{\text{Covert penalty}}+\underbrace{\frac{1}{\beta_{0}+\sum_{k=1}^{K}\beta_{k}}\sum_{k=1}^{K}\beta_{k}\mathbf{1}\big{(}R_{k}^{0}-R_{k}^{tot}\big{)}}_{\text{QoS penalty}},

where β0\beta_{0} and βk\beta_{k} are control variables. 𝟏(ab)\mathbf{1}(a-b) is the indicator function in which 𝟏(ab)=1\mathbf{1}(a-b)=1 if ab>0a-b>0, and otherwise 𝟏(ab)=0\mathbf{1}(a-b)=0. The meaning of ptp_{t} can be expressed as follows. The penalty is increased with each of covert or QoS constraints, i.e., (11e) and (11f), multiplying with corresponding weights β0\beta_{0} and βk\beta_{k} (k=1,2,,Kk=1,2,\ldots,K). The penalty is then normalized so that pt[0,1]p_{t}\in[0,1] (i.e., the first line of (17)). The penalty can be rewritten as the sum of two components, i.e., covert penalty and QoS penalty as shown in the second line of (17). Note that the penalty of the BS can be calculated by using the feedback mechanism. Once the users receive the messages from the BS, they calculate the data rates of the messages and send the calculated data rates back to the BS along with their minimum rate requirements (QoS requirements) [34]. Based on the feedback from users, the BS can calculate the corresponding QoS penalty. Similarly, the BS can be notified by the users if the channel is interrupted by the warden and the covert penalty can be calculated accordingly. Our designed immediate reward aims to encourage the BS to minimize the penalty ptp_{t} to 0. Thus, the max-min fairness is guaranteed while covert and QoS constraints are satisfied.

III-C Optimization Formulation

We consider a stochastic policy Ωθ\Omega_{\theta} of the BS (i.e., Ωθ:𝒮×𝒜[0,1]\Omega_{\theta}:\mathcal{S}\times\mathcal{A}\rightarrow[0,1]), as a probability that action ata_{t} is taken given the current state sts_{t}, i.e., Ωθ=Pr{at|st;θ}\Omega_{\theta}=\text{Pr}\{a_{t}|s_{t};\theta\}, where θ\theta is the policy parameter vector of the deep neural network. Let J(Ωθ)J(\Omega_{\theta}) denote the expected discounted reward of the BS by following policy Ωθ\Omega_{\theta}:

J(Ωθ)=𝔼atΩ,st𝒫[t=0τtrt(st,at)],J(\Omega_{\theta})=\mathbb{E}_{a_{t}\sim\Omega,s_{t}\sim\mathcal{P}}\Big{[}\sum_{t=0}^{\infty}\tau^{t}r_{t}(s_{t},a_{t})\Big{]}, (18)

where 𝒫(st+1|st,at)\mathcal{P}(s_{t+1}|s_{t},a_{t}) is the state transition probability distribution which models the dynamics of the environment, i.e., the dynamics of channel state information. Here, 𝒫\mathcal{P} is unknown to the BS. Our goal is to find the optimal policy Ωθ\Omega_{\theta}^{*} for the BS that maximizes J(ΩθJ(\Omega_{\theta}), i.e.,

maxΩθ\displaystyle\max_{\Omega_{\theta}} J(Ωθ)\displaystyle J(\Omega_{\theta}) (19)
s.t. atΩθ(at|st),\displaystyle a_{t}\sim\Omega_{\theta}(a_{t}|s_{t}),
st+1𝒫(st+1|st,at).\displaystyle s_{t+1}\sim\mathcal{P}(s_{t+1}|s_{t},a_{t}).

Maximizing J(Ωθ)J(\Omega_{\theta}) is very challenging as we consider that the state and action spaces are continuous. It is noted that in (19), we do not require the complete information of the channel, i.e., channel matrix 𝐇\mathbf{H} in (11), as other works in literature [7, 6, 9, 10]. Instead, the patterns of the channel can be learned through feedback from the users with deep neural networks. For this, we develop a learning algorithm based on a policy gradient method, namely Proximal Policy Optimization (PPO)  [28], to approximate the optimal policy of the BS. PPO is a sample-efficient algorithm which can work under the large continuous state and action spaces and can deal with the uncertainty of the channel state.

IV Proximal Policy Optimization Algorithm

IV-A PPO Algorithm

As we discussed in Section III-A, action-value methods and policy gradient methods have their own advantages and drawbacks. In action-value methods, each action of the agent can be categorized by a real positive value, e.g., Q-value [33], and once the optimal policy is obtained, the optimal actions can be obtained by selecting the maximum action-value at each state. This family of algorithms is well studied for environments with discrete action space, e.g., a game requires a player to turn left, right, or jump. Thus, the action-value methods are suitable for discrete action space and the optimal policy can be effectively estimated if the number of actions are relatively small. However, in many cases, the action of an agent cannot be categorized by discrete action-values, e.g., a task requires controlling a robot arm by using continuous force. For this, policy gradient methods can be applied by directly estimating the policy of the agent instead of using a greedy selection over action-values. The policy of the agent can be a distribution, e.g., Gaussian, over actions. Therefore, instead of finding the action-values of the agent, policy gradient algorithms aim to find the “shape” of the action distribution, i.e., mean and variance of the distribution.

In our problem, all considered actions in (11) are continuous, and thus only gradient policy methods can be used. In the following, we describe an effective algorithm based on PPO [28] to maximize the min-rate of the BS. The operation of PPO in our proposed framework is illustrated in Fig. 2. In particular, input of the policy update procedure is the joint state of channel and packets’ lengths to be sent at the BS. Output is the BS’s policy, i.e., action distributions. We use one Gaussian distribution to illustrate the output of the policy update in Fig. 2 for the sake of presentation simplicity. In our actual implementation, the action of the BS has multiple dimensions and each dimension can be represented by a Gaussian distribution which differs in mean and variance values. The details of PPO algorithm are as follows.

Refer to caption
Figure 2: PPO policy update at the BS.

The PPO uses two deep neural networks as a policy parameter vector and a value function vector, denoted by θ\theta and Θ\Theta, respectively, to efficiently update the policy. The policy parameter vector θ\theta can be updated by using a gradient ascent method as follows:

θt+1=θt+αg^t,\theta_{t+1}=\theta_{t}+\alpha\hat{g}_{t}, (20)

where α\alpha is the step size, and g^t\hat{g}_{t} is a gradient estimator. The gradient estimator g^t\hat{g}_{t} can be calculated by differentiating a loss function as follows:

g^t=θL(θ).\hat{g}_{t}=\nabla_{\theta}L(\theta). (21)

It can be observed from (20) and (21) that the choice of the loss function L(θ)L(\theta) has a significant impact on the policy update. L(θ)L(\theta) should have a small variance so that it does not cause bad gradient updates which result in significant decreases of J(Ωθ)J(\Omega_{\theta}). Since continuous action space is sensitive to the policy update, a minor negative change in updating θ\theta can lead to destructively large policy updates [28]. To overcome this problem, PPO algorithm uses a loss function LPPO(θ)L^{PPO}(\theta) to replace L(θ)L(\theta):

LPPO(θ)=min(ΩθΩθoldA^t,u(ε,A^t)),L^{PPO}(\theta)=\min\Big{(}\frac{\Omega_{\theta}}{\Omega_{\theta_{old}}}\hat{A}_{t},u(\varepsilon,\hat{A}_{t})\Big{)}, (22)

where A^t\hat{A}_{t} is the advantage function and u(ε,A^t)u(\varepsilon,\hat{A}_{t}) is the clip function. A^t\hat{A}_{t} estimates whether the action taken is better than the policy’s default behavior and u()u(\cdot) limits significant updates which may degrade J(Ωθ)J(\Omega_{\theta}). The advantage function at time step tt can be defined by:

A^t(st,at;θ)=Qt(st,at;θ)Vt(st;Θ),\hat{A}_{t}(s_{t},a_{t};\theta)=Q_{t}(s_{t},a_{t};\theta)-V_{t}(s_{t};\Theta), (23)

where Qt(st,at;θ)=𝔼atΩθ,st𝒫[l=0τlrt(st+l,at+l)]Q_{t}(s_{t},a_{t};\theta)=\mathbb{E}_{a_{t}\sim\Omega_{\theta},s_{t}\sim\mathcal{P}}\Big{[}\sum_{l=0}^{\infty}\tau^{l}r_{t}(s_{t+l},a_{t+l})\Big{]} is the action value function and Vt(st;Θ)=𝔼st𝒫[l=0τlrt(st+l,at+l)]V_{t}(s_{t};\Theta)=\mathbb{E}_{s_{t}\sim\mathcal{P}}\Big{[}\sum_{l=0}^{\infty}\tau^{l}r_{t}(s_{t+l},a_{t+l})\Big{]} is the state value function. The clip function is thus defined as follows [28]:

u(ε,A^t)={(1+ε)A^t,ifA^t0,(1ε)A^t,ifA^t<0.u(\varepsilon,\hat{A}_{t})=\begin{cases}(1+\varepsilon)\hat{A}_{t},&\text{if}\ \hat{A}_{t}\geq 0,\\ (1-\varepsilon)\hat{A}_{t},&\text{if}\ \hat{A}_{t}<0.\end{cases} (24)

The idea of PPO is to prevent the new policy from being attracted to go far away from the old policy Ωold\Omega_{old}. The first term inside the min\min operator in (22), i.e., ΩθΩθoldA^t\frac{\Omega_{\theta}}{\Omega_{\theta_{old}}}\hat{A}_{t}, is the surrogate objective which takes into consideration the probability ratio between the new policy and old policy, i.e., ΩθΩθold\frac{\Omega_{\theta}}{\Omega_{\theta_{old}}}. The second term, i.e., u(ε,A^t)u(\varepsilon,\hat{A}_{t}), removes the incentive for moving this probability ratio outside of the interval [1ε,1+ε][1-\varepsilon,1+\varepsilon]. The pseudo-code of the PPO algorithm is described in Algorithm 1.

The main steps of the PPO algorithm can be described as follows. First, a policy parameter vector θ0\theta_{0} and a value function parameter vector Θ0\Theta_{0} are randomly initialized (i.e., lines 2 and 3 in Algorithm 1). Second, at each policy update episode, numbered by kk, the BS collects a set of trajectories k\mathcal{B}_{k}, i.e., a batch of state, action, and reward values, by running current policy Ωθk\Omega_{\theta_{k}} over TT time steps (i.e., line 5). After that, the cumulative reward is calculated as in line 6. Next, the BS computes the advantage function as in (23) (i.e., line 6). With the obtained advantage function, the loss function can be calculated as (22) and the policy parameter vector can be updated as line 8 in Algorithm 1. Finally, the value function parameter vector can be updated as in line 9. The procedure is repeated until the cumulative reward values converge to saturating values.

1 Input:
2 Initialize policy parameter vector θ0\theta_{0},
3 Initialize value function parameter vector Θ0\Theta_{0},
4 for k=0,1,2,k=0,1,2,\ldots do
5       Collect set of trajectories k={υt;υt=(st,at,rt)}\mathcal{B}_{k}=\left\{\upsilon_{t};\upsilon_{t}=(s_{t},a_{t},r_{t})\right\} by running policy Ωθk\Omega_{\theta_{k}} in the environment
6       Compute cumulative reward R^t=t=0Tτtrt\hat{R}_{t}=\sum_{t=0}^{T}\tau^{t}r_{t}
7       Compute advantage function A^t\hat{A}_{t} as in (23)
8       Update the policy by maximizing the objective (22):
θk+1=argmaxθ1|k|Tυkt=0TLPPO(θ),\theta_{k+1}=\operatorname*{\arg\!\max}_{\theta}\frac{1}{\left|\mathcal{B}_{k}\right|T}\sum_{\upsilon\in\mathcal{B}_{k}}\sum_{t=0}^{T}L^{PPO}(\theta),
9       Fit value function by regression on mean-squared error:
Θk+1=argminΘ1|k|Tυkt=0T(Vt(st;Θ)R^t)2\Theta_{k+1}=\operatorname*{\arg\!\min}_{\Theta}\frac{1}{\left|\mathcal{B}_{k}\right|T}\sum_{\upsilon\in\mathcal{B}_{k}}\sum_{t=0}^{T}\left(V_{t}\left(s_{t};\Theta\right)-\hat{R}_{t}\right)^{2}
10 end for
Outputs: Ωθ=Pr(at|st;θ)\Omega_{\theta}^{*}=\text{Pr}(a_{t}|s_{t};\theta)
Algorithm 1 Proximal Policy Optimization (PPO)

IV-B Complexity Analysis

We further analyze the computational complexity of the PPO algorithm used in our considered system. Since the PPO uses deep neural networks as an approximator function, the complexity mostly depends on updating these networks. As the two deep neural networks in PPO share the same architecture, the complexity of updating these networks can be analyzed as follows. Each network consists of an input layer X0X_{0}, two fully-connected layers X1X_{1} and X2X_{2}, and an output layer X3X_{3}. Let |Xi||X_{i}| be the size of the layer XiX_{i}, i.e., the number of neurons in layer XiX_{i}. The complexity of the two networks can be calculated by 2(|X0||X1|+|X1||X2|+|X2||X3|)2(|X_{0}||X_{1}|+|X_{1}||X_{2}|+|X_{2}||X_{3}|). At each episode update, a trajectory, i.e., a batch of state, action, and reward values, are sampled by running the current policy to calculate the advantage function and value function to update the network. Thus, the total complexity of the training process is O(2T|k|(|X0||X1|+|X1||X2|+|X2||X3))O\Big{(}2T|\mathcal{B}_{k}|(|X_{0}||X_{1}|+|X_{1}||X_{2}|+|X_{2}||X_{3})\Big{)}, where |k||\mathcal{B}_{k}| is the size of the trajectory sampled from environment. There are two main reasons that PPO is a sample-efficient algorithm. First, the size of a trajectory k\mathcal{B}_{k} is relatively small, i.e., from hundreds to thousands [28], compared with the size of a replay memory in conventional action-value methods, e.g., from 50,000 to 1,000,000 in DQN [33] . Here, in our simulations, |k||\mathcal{B}_{k}| is set at 200. Second, the size of the output layer of PPO is equal to the number of action dimensions. As a result, the size of the output layer can be significantly smaller than those of action-value methods that discrete continuous action space into different chunks to compute the action values, e.g., Q-values. Clearly, the architecture of the deep neural networks are simple enough to be implemented in the base stations which are usually equipped with sufficient computing resources.

V Performance Evaluation

V-A Parameter Setting

We consider our simulation parameters as follows. We use the same parameters for the RSMA and covert communications as those in [4, 30]. The total transmission power of the BS is set to be Pt=20P_{t}=20 (dB). The control variables of channel in (13) are set at (g1,g2,g3)=(1.0,0.8,0.2)(g_{1},g_{2},g_{3})=(1.0,0.8,0.2), and (ϕ1,ϕ2,ϕ3)=(0,π9,2π9)(\phi_{1},\phi_{2},\phi_{3})=(0,\frac{\pi}{9},\frac{2\pi}{9}). Furthermore, the channel estimation errors are set at σ12=Pt0.6\sigma_{1}^{2}=P_{t}^{-0.6}, σ22=0.8Pt0.6\sigma_{2}^{2}=0.8P_{t}^{-0.6}, and σ32=0.2Pt0.6\sigma_{3}^{2}=0.2P_{t}^{-0.6} [4]. Those equivalent values for the warden in (14) are set as gw=0.4g_{w}=0.4 and ϕw=π6\phi_{w}=\frac{\pi}{6}. The covert requirement parameter ϵ=0.1\epsilon=0.1 [30]. Unlike conventional transmission schemes, covert communications require a relatively low data rate to hide information from the warden/adversary. Therefore, we set the QoS requirements of the users to be R10=R20=R30=104R_{1}^{0}=R_{2}^{0}=R_{3}^{0}=10^{-4} (bps/Hz). We assume that the length of message WkW_{k} to be sent at the BS follows uniform distribution with the minimum and maximum values are 0 and 1 kilobits, respectively, i.e., Lk𝒰(0,1.0)L_{k}\sim\mathcal{U}(0,1.0) (kilobits). The number of antennas at the BS and the number of users are set as M=K=3M=K=3.

For the deep neural networks, our parameters are set as follows. The two deep neural networks representing the policy parameter vector and the value function vector, i.e., θ\theta and Θ\Theta, respectively, share the same architecture. Each deep neural network has two fully connected layer and each layer contains 64 neurons. The number of neurons in the output layer is equal to the number of dimensions of action, i.e., 3K+13K+1. The number of neurons in the input layer is equal to the number of dimensions of the joint state at the BS (as shown in Fig. 2), i.e., 2K2K. The learning rate and clip values of PPO are adopted from [28].

In the following, we investigate the performance of our proposed PPO algorithm on RSMA and SDMA systems, denoted as P-RSMA and P-SDMA, respectively. To further understand the impacts of covert communications on both RSMA and SDMA systems, we run various simulations for scenarios in the FBL regime and infinite blocklength (IBL) regime. It is noted that in the IBL regime, the BS can achieve full data rate with Shannon capacity and the covert constraint, i.e., constraint (11e), is temporarily removed. In the case of covert communications under consideration with FBL, the optimization problem is fully considered as in (11). Furthermore, we introduce other baselines in which a Greedy algorithm is applied. This is to evaluate efficiency of the proposed learning algorithm. These baselines, denoted as G-RSMA and G-SDMA, aim to obtain the maximum immediate reward at each time step, compared with all the historical reward values stored in a buffer, without considering the long-term cumulative reward. In the following, we investigate the performance of all the aforementioned schemes. The considered metrics are (i) average (covert) min-rate (or average reward) and (ii) average (covert) sum-rate.

V-B Simulation Results

V-B1 Convergence property

Refer to caption
(a)
Refer to caption
(b)
Figure 3: (a) Average min-rate and (b) average sum-rate with Pt=30P_{t}=30 (dB) allocated for 3 users.

We first investigate the convergence performance of the proposed P-RSMA and P-SDMA in the IBL regime (lines (3), (4), (7), and (8) in Fig. 3). It can be observed from Fig. 3 that the proposed P-RSMA achieves the highest min-rate and sum-rate values, followed by P-SDMA (lines (3) and (4)). The reason is that in the IBL regime, the BS can obtain the data rate with full Shannon capacity. The results also suggest that RSMA outperforms SDMA in both min-rate and sum-rate, which strongly confirms that RSMA performs better than SDMA [4]. In the same IBL regime, the min-rate and sum-rate obtained of G-RSMA and G-SDMA are much lower than those of P-RSMA and P-SDMA (lines (7) and (8)). The reason is that the Greedy algorithm only considers historical rewards and immediate rewards without aiming to maximize the long-term reward. Unlike the Greedy scheme, the PPO with deep neural networks can iteratively update the policy toward the maximum cumulative reward. Furthermore, bad updates negating the reward values are eliminated with clip function (24). Thus, the learning curves obtained by P-RSMA and P-SDMA are much more stable.

In the FBL regime, it can be observed that the min-rate and sum-rate obtained by all the schemes are much lower than those in the IBL regime (lines (1), (2), (5), and (6)). The reason is that (i) the data rate is no longer following the Shannon capacity and (ii) the data rate is reduced close to 0 to achieve covertness [19]. With G-RSMA and G-SDMA (lines (5) and (6)), the min-rate and sum-rate are 0. In other words, the Greedy algorithm can only achieve covert communications by reducing the data rate to 0 and no information can be exchanged between the BS and users. With P-RSMA and P-SDMA (lines (1) and (2)), both min-rate and sum-rate values are relatively small but remained positive when the algorithm converges. These results confirm that with the proposed PPO algorithm, the BS and users can exchange covert information without being detected by the warden.

V-B2 Impacts of transmission power

Refer to caption
(a)
Refer to caption
(b)
Figure 4: (a) Average min-rate and (b) average sum-rate vs. transmission power PtP_{t} (dB).

Next, in Fig. 4 we vary the transmission power PtP_{t} at the BS and evaluate the performance of the proposed schemes. Similarly, we first discuss the performance of all the schemes in the IBL regime. It can be observed that the min-rate and sum-rate obtained by P-RSMA and P-SDMA increase with the transmission power at the BS (lines (3) and (4)). Unlike RSMA, SDMA’s data rate is saturated with high transmission power [7]. In low power transmission region (e.g., 0 dB to 20 dB), the difference between RSMA and SDMA is insignificant [2]. Similar to the results in Fig. 3, the min-rate and sum-rate of G-RSMA and G-SDMA are much lower than those of P-RSMA and P-SDMA.

In the FBL regime, it can be observed that the min-rate and sum-rate obtained by P-RSMA and P-SDMA are much lower than those in the IBL regime (lines (1) and (2)). When PtP_{t} increases, the data rates of P-RSMA and P-SDMA remain unchanged at 0.007 bps/Hz for the average min-rate and 0.05 bps/Hz for the average sum-rate. These results imply that with the proposed PPO algorithm, the covert communications between the BS and users can be maintained at a positive rate. In other words, the covertness can always be achieved regardless of the transmission power at the BS. Unlike the PPO, the Greedy algorithm can only hide information from the warden by reducing the data rate to 0 or no information can be exchanged (lines (5) and (6)).

V-B3 Impacts of covert constraint

Refer to caption
(a)
Refer to caption
(b)
Figure 5: (a) Average min-rate and (b) average sum-rate vs. covert requirement (ϵ\epsilon).

In Fig. 5, we evaluate the impacts of the covert constraint to the system performance by varying ϵ\epsilon in (11e). Similarly, we first discuss the results in the IBL regime. Since the impacts of the covert constraint are not considered in this regime, it is clearly observed that the average min-rate and sum-rate values of P-RSMA and P-SDMA (lines (3) and (4)) remain stable and significantly higher than those of the baselines G-RSMA and G-SDMA (lines (7) and (8)) with the increase of the covert constraint parameter ϵ\epsilon.

In the FBL regime, the average min-rate and sum-rate values obtained by P-RSMA and P-SDMA remain stable as ϵ\epsilon increases (lines (1) and (2)). In particular, these saturated values are 0.007 bps/Hz for min-rate and 0.05 bps/Hz for sum-rate. For the baselines G-RSMA and G-SDMA (lines (5) and (6)), the obtained min-rate and sum-rate values are equal to 0, which illustrates that these baselines cannot achieve covert transmissions in the considered setting.

V-B4 Impacts of blocklength

Refer to caption
(a)
Refer to caption
(b)
Figure 6: (a) Average min-rate and (b) average sum-rate vs. maximum blocklength (LkL_{k}).

Finally, in Fig. 6, we investigate the impacts of blocklength, i.e., the length LkL_{k} of the message WkW_{k} being sent at the BS, to the system performance. We vary the distributions of the packet length LkL_{k} at the BS with different intervals. In particular, we consider nine distribution intervals that are 𝒰[0,0.1],𝒰[0.1,0.2]\mathcal{U}[0,0.1],\mathcal{U}[0.1,0.2], 𝒰[0.2,0.3],,𝒰[0.8,0.9]\mathcal{U}[0.2,0.3],\ldots,\mathcal{U}[0.8,0.9] (Kilobits). Note that in Fig. 6, we denote these distributions by their maximum values for the sake of simplicity. It can be observed that, in the IBL regime where LkL_{k}\rightarrow\infty, the values of min-rate and sum-rate of all the schemes are independent with the blocklength (lines (3), (4), (7), and (8)). However, in the FBL regime, the min-rate and sum-rate values obtained by P-RSMA and P-SDMA decrease as the blocklength increases (lines (1) and (2)). In other words, the higher the blocklength of message is sent from the BS, the lower the data rate can be achieved. This finding is similar to mathematical analysis derived in [19]. According to [19], the number of bits that can be covertly transmitted, denoted as nn, asymptotically approaches zero and nn follows a square root law, i.e., 𝒪(n)/n0\mathcal{O}(\sqrt{n})/n\rightarrow 0 as n0n\rightarrow 0. Unlike the positive data rate values achieved by the proposed schemes, the min-rate and sum-rate values of the baselines G-RSMA and G-SDMA remain at 0 (lines (5) and (6)).

VI Conclusion

In this paper, we have developed a novel dynamic framework to jointly optimize power allocation and rate control for the RSMA networks under the uncertainty of surrounding environment and with requirements about covert communications. In particular, our proposed stochastic optimization framework can adjust its transmission power together with message splitting based on its observations from surrounding environment to maximize the rate performance for the whole system. Furthermore, we have developed a learning algorithm that can not only help to BS to deal with continuous action and state spaces effectively, but also quickly find the optimal policy for the BS without requiring the completed information about surrounding environment in advance. Extensive simulations have demonstrated that with the obtained policy, the BS can dynamically adjust power and transmission rates to users, so that the achievable covert rate can be maximized. At the same, the BS can minimize the probability of being detected by the warden.

-A Relative entropy 𝒟(0|1)\mathcal{D}(\mathbb{P}_{0}|\mathbb{P}_{1}) between two distributions 0\mathbb{P}_{0} and 1\mathbb{P}_{1}

As defined in the hypothesis test of the warden in (7), we have the distribution of the i.i.d. Gaussian random variables with variance σw2\sigma_{w}^{2} is 0=𝒩(0,σw2)\mathbb{P}_{0}=\mathcal{N}(0,\sigma_{w}^{2}), which corresponds to the case when the BS is not transmitting. Note that the warden does not know the codebook. Therefore, the warden’s probability distribution of the transmitted symbols is of zero-mean i.i.d. Gaussian random variables with variance PfP_{f}. Since we have the signal 𝐱\mathbf{x} is transmitted with power PtP_{t} at the transmitter and channel between the warden and the BS is defined in (14), we have Pf=gwPtP_{f}=g_{w}P_{t}. Therefore, the distribution of 1\mathbb{P}_{1} is as follows:

1=𝒩(0,Pf+σw2)=𝒩(0,gwPt+σw2).\begin{split}\mathbb{P}_{1}&=\mathcal{N}(0,P_{f}+\sigma_{w}^{2})\\ &=\mathcal{N}(0,g_{w}P_{t}+\sigma_{w}^{2}).\end{split} (25)

Let a=σwa=\sigma_{w} and b=gwPt+σw2b=\sqrt{g_{w}P_{t}+\sigma_{w}^{2}}, we have the respective probability distribution functions of 0\mathbb{P}_{0} and 1\mathbb{P}_{1} are as follows:

p0(x)=12πae12(xa)2,p_{0}(x)=\frac{1}{\sqrt{2\pi}a}\mathrm{e}^{\frac{-1}{2}(\frac{x}{a})^{2}}, (26)
p1(x)=12πbe12(xb)2.p_{1}(x)=\frac{1}{\sqrt{2\pi}b}\mathrm{e}^{\frac{-1}{2}(\frac{x}{b})^{2}}. (27)

The relative entropy between 0\mathbb{P}_{0} and 1\mathbb{P}_{1} is then calculated by:

𝒟(0|1)=+p0(x)lnp0(x)p1(x)dx=+12πae12(xa)2ln(bae12[(xa)2(xb)2])dx=+12πae12(xa)2(ln(ba)12[(xa)2(xb)2])dx=1232πa3b2+((b2a2)x22a2b2ln(ba))ex22a2dx𝒟1 (apply linearity).\begin{split}\mathcal{D}(\mathbb{P}_{0}|\mathbb{P}_{1})&=\int_{-\infty}^{+\infty}p_{0}(x)\ln\frac{p_{0}(x)}{p_{1}(x)}\mathrm{~{}d}x\\ &=\int_{-\infty}^{+\infty}\frac{1}{\sqrt{2\pi}a}\mathrm{e}^{\frac{-1}{2}(\frac{x}{a})^{2}}\ln\Big{(}\frac{b}{a}e^{\frac{-1}{2}\big{[}(\frac{x}{a})^{2}-(\frac{x}{b})^{2}\big{]}}\Big{)}\mathrm{~{}d}x\\ &=\int_{-\infty}^{+\infty}\frac{1}{\sqrt{2\pi}a}\mathrm{e}^{\frac{-1}{2}(\frac{x}{a})^{2}}\Big{(}\ln\big{(}\frac{b}{a}\big{)}-\frac{1}{2}\big{[}\big{(}\frac{x}{a}\big{)}^{2}-\big{(}\frac{x}{b}\big{)}^{2}\big{]}\Big{)}\mathrm{~{}d}x\\ &=-\frac{1}{2^{\frac{3}{2}}\sqrt{\pi}a^{3}b^{2}}\underbrace{\int_{-\infty}^{+\infty}\left(\left(b^{2}-a^{2}\right)x^{2}-2a^{2}b^{2}\ln\left(\frac{b}{a}\right)\right)\mathrm{e}^{-\frac{x^{2}}{2a^{2}}}\mathrm{~{}d}x}_{\mathcal{D}_{1}}\text{ (apply linearity)}.\end{split} (28)

Now we need to solve 𝒟1\mathcal{D}_{1}. We expand 𝒟1\mathcal{D}_{1} and apply linearity:

𝒟1=+((b2a2)x2ex22a22a2b2ln(ba)ex22a2)𝑑x=(b2a2)+x2ex22a2dx𝒟22a2b2ln(ba)+ex22a2dx𝒟3\begin{split}\mathcal{D}_{1}&=\int_{-\infty}^{+\infty}\left(\left(b^{2}-a^{2}\right)x^{2}e^{-\frac{x^{2}}{2a^{2}}}-2a^{2}b^{2}\ln\left(\frac{b}{a}\right)\mathrm{e}^{-\frac{x^{2}}{2a^{2}}}\right)dx\\ &=\left(b^{2}-a^{2}\right)\underbrace{\int_{-\infty}^{+\infty}x^{2}\mathrm{e}^{-\frac{x^{2}}{2a^{2}}}\mathrm{~{}d}x}_{\mathcal{D}_{2}}-2a^{2}b^{2}\ln\left(\frac{b}{a}\right)\underbrace{\int_{-\infty}^{+\infty}\mathrm{e}^{-\frac{x^{2}}{2a^{2}}}\mathrm{~{}d}x}_{\mathcal{D}_{3}}\end{split} (29)

We first solve 𝒟2\mathcal{D}_{2}. For this, we integrate 𝒟1\mathcal{D}_{1} by parts, i.e., fg=fgfg\int fg^{\prime}=fg-\int f^{\prime}g. Let f=xf=x and g=xex22a2g^{\prime}=x\mathrm{e}^{-\frac{x^{2}}{2a^{2}}}, we can calculate f=1f^{\prime}=1 and g=a2ex22a2g=-a^{2}\mathrm{e}^{-\frac{x^{2}}{2a^{2}}}. We now have:

𝒟2=a2xex22a2+a2ex22a2dx𝒟4.\mathcal{D}_{2}=-a^{2}x\mathrm{e}^{-\frac{x^{2}}{2a^{2}}}-\underbrace{\int_{-\infty}^{+\infty}-a^{2}\mathrm{e}^{-\frac{x^{2}}{2a^{2}}}\mathrm{~{}d}x}_{\mathcal{D}_{4}}. (30)

𝒟4\mathcal{D}_{4} can be solved as follows. We substitute u=x2adx=2aduu=\frac{x}{\sqrt{2}a}\rightarrow\mathrm{~{}d}x=\sqrt{2}a\mathrm{~{}d}u. 𝒟4\mathcal{D}_{4} becomes:

𝒟4=πa32+2eu2πdu.\mathcal{D}_{4}=-\frac{\sqrt{\pi}a^{3}}{\sqrt{2}}\int_{-\infty}^{+\infty}\frac{2\mathrm{e}^{-u^{2}}}{\sqrt{\pi}}\mathrm{d}u. (31)

Note that we have a special integral in 𝒟4\mathcal{D}_{4}, i.e., +2eu2πdu= erf(u)\int_{-\infty}^{+\infty}\frac{2\mathrm{e}^{-u^{2}}}{\sqrt{\pi}}\mathrm{d}u=\text{ erf}(u) is a Gauss error function. Let’s plug in 𝒟4\mathcal{D}_{4}:

𝒟4=πa3erf(u)2=πa3erf(x2a)2 (undo substitution u=x2a).\begin{split}\mathcal{D}_{4}&=-\frac{\sqrt{\pi}a^{3}\operatorname{erf}(u)}{\sqrt{2}}\\ &=-\frac{\sqrt{\pi}a^{3}\operatorname{erf}\left(\frac{x}{\sqrt{2}a}\right)}{\sqrt{2}}\text{ (undo substitution $u=\frac{x}{\sqrt{2}a}$)}.\end{split} (32)

Plug 𝒟4\mathcal{D}_{4} in 𝒟2\mathcal{D}_{2}:

𝒟2=πa3erf(x2a)2a2xex22a2.\mathcal{D}_{2}=\frac{\sqrt{\pi}a^{3}\operatorname{erf}\left(\frac{x}{\sqrt{2}a}\right)}{\sqrt{2}}-a^{2}x\mathrm{e}^{-\frac{x^{2}}{2a^{2}}}. (33)

Once 𝒟2\mathcal{D}_{2} is solved, 𝒟3=+ex22a2dx\mathcal{D}_{3}=\int_{-\infty}^{+\infty}\mathrm{e}^{-\frac{x^{2}}{2a^{2}}}\mathrm{~{}d}x can be calculated as follows. Let’s substitute u=x2adx=2aduu=\frac{x}{\sqrt{2}a}\rightarrow\mathrm{~{}d}x=\sqrt{2}a\mathrm{~{}d}u. 𝒟3\mathcal{D}_{3} becomes:

𝒟3=πa2+2eu2πdu.\mathcal{D}_{3}=\frac{\sqrt{\pi}a}{\sqrt{2}}\int_{-\infty}^{+\infty}\frac{2\mathrm{e}^{-u^{2}}}{\sqrt{\pi}}\mathrm{d}u. (34)

Use the previous result of Gauss error function, we have:

𝒟3=πaerf(u)2=πaerf(x2a)2 (undo substitution u=x2a).\begin{split}\mathcal{D}_{3}&=\frac{\sqrt{\pi}a\operatorname{erf}(u)}{\sqrt{2}}\\ &=\frac{\sqrt{\pi}a\operatorname{erf}\left(\frac{x}{\sqrt{2}a}\right)}{\sqrt{2}}\text{ (undo substitution $u=\frac{x}{\sqrt{2}a}$)}.\end{split} (35)

Once 𝒟2\mathcal{D}_{2} and 𝒟3\mathcal{D}_{3} are solved, let’s plug (33) and (35) into (29):

𝒟1=(b2a2)𝒟22a2b2ln(ba)𝒟3=2πa3b2ln(ba)erf(x2a)+πa3(b2a2)erf(x2a)2a2(b2a2)xex22a2\begin{split}\mathcal{D}_{1}&=(b^{2}-a^{2})\mathcal{D}_{2}-2a^{2}b^{2}\ln\left(\frac{b}{a}\right)\mathcal{D}_{3}\\ &=-\sqrt{2}\sqrt{\pi}a^{3}b^{2}\ln\left(\frac{b}{a}\right)\operatorname{erf}\left(\frac{x}{\sqrt{2}a}\right)+\frac{\sqrt{\pi}a^{3}\cdot\left(b^{2}-a^{2}\right)\operatorname{erf}\left(\frac{x}{\sqrt{2}a}\right)}{\sqrt{2}}-a^{2}\cdot\left(b^{2}-a^{2}\right)x\mathrm{e}^{-\frac{x^{2}}{2a^{2}}}\end{split} (36)

Finally, we have:

𝒟(0|1)=1232πa3b2𝒟1=(ln(ba)erf(x2a)2(b2a2)erf(x2a)4b2+(b2a2)xex22a2232πab2)|+=ln(b)ln(a)+a22b212.\begin{split}\mathcal{D}(\mathbb{P}_{0}|\mathbb{P}_{1})&=-\frac{1}{2^{\frac{3}{2}}\sqrt{\pi}a^{3}b^{2}}\mathcal{D}_{1}\\ &=\left.\left(\frac{\ln\left(\frac{b}{a}\right)\operatorname{erf}\left(\frac{x}{\sqrt{2}a}\right)}{2}-\frac{\left(b^{2}-a^{2}\right)\operatorname{erf}\left(\frac{x}{\sqrt{2}a}\right)}{4b^{2}}+\frac{\left(b^{2}-a^{2}\right)x\mathrm{e}^{-\frac{x^{2}}{2a^{2}}}}{2^{\frac{3}{2}}\sqrt{\pi}ab^{2}}\right)\right\rvert_{-\infty}^{+\infty}\\ &=\ln(b)-\ln(a)+\frac{a^{2}}{2b^{2}}-\frac{1}{2}.\end{split} (37)

Undo substitution for a=σwa=\sigma_{w} and b=gwPt+σw2b=\sqrt{g_{w}P_{t}+\sigma_{w}^{2}}, we have:

𝒟(0|1)=ln(gwPt+σw2)ln(σw)+σw22(gwPt+σw2)12.\mathcal{D}(\mathbb{P}_{0}|\mathbb{P}_{1})=\ln\left(\sqrt{g_{w}P_{t}+\sigma_{w}^{2}}\right)-\ln\left(\sigma_{w}\right)+\frac{\sigma_{w}^{2}}{2\left(g_{w}P_{t}+\sigma_{w}^{2}\right)}-\frac{1}{2}. (38)

The proof of Proposition 1 is now completed.

References

  • [1] O. Dizdar, Y. Mao, W. Han, and B. Clerckx, “Rate-splitting multiple access: A new frontier for the PHY layer of 6G,” in 2020 IEEE 92nd Vehicular Technology Conference (VTC2020-Fall), Jul. 2020, pp. 1-7.
  • [2] O. Dizdar, Y. Mao, W. Han, and B. Clerckx, “Rate-splitting multiple access for downlink multi-antenna communications: Physical layer design and link-level simulations,” in 2020 IEEE 31st Annual International Symposium on Personal, Indoor and Mobile Radio Communications, Aug. 2020, pp. 1-6.
  • [3] P. Li, M. Chen, Y. Mao, Z. Yang, B. Clerckx, and M. Shikh-Bahaei, “Cooperative rate-splitting for secrecy sum-rate enhancement in multiantenna broadcast channels,” in 2020 IEEE 31st Annual International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC), Oct. 2020, pp. 1-6.
  • [4] Y. Mao, B. Clerckx, and V. O. Li, “Rate-splitting multiple access for downlink communication systems: bridging, generalizing, and outperforming SDMA and NOMA,” EURASIP Journal on Wireless Communications and Networking, no. 1, pp. 1-54, Dec. 2018.
  • [5] Y. Mao, B. Clerckx, and V. O. K. Li, “Energy efficiency of rate-splitting multiple access, and performance benefits over SDMA and NOMA,” in 15th Internaltional Symposium of Wireless Communication Systems (ISWCS), Aug. 2018, pp. 1–5.
  • [6] H. Joudeh and B. Clerckx, “Sum-rate maximization for linearly precoded downlink multiuser MISO systems with partial CSIT: A rate-splitting approach,” IEEE Transactions on Communications, vol. 64, no. 11 , pp. 4847–4861, Nov. 2016.
  • [7] H. Joudeh, and B. Clerckx, “A rate-splitting strategy for max-min fair multigroup multicasting,” in 2016 IEEE 17th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), Jul. 2016, pp. 1-5.
  • [8] G. Zhou, Y. Mao, and B. Clerckx, “Rate-splitting multiple access for multi-antenna downlink communication systems: Spectral and energy efficiency tradeoff,” IEEE Transactions on Wireless Communications, Dec. 2021.
  • [9] E. Piovano, H. Joudeh, and B. Clerckx, “Overloaded multiuser MISO transmission with imperfect CSIT,” in 50th Asilomar Conference on Signals, Systems, and Computers, Nov. 2016.
  • [10] H. Joudeh and B. Clerckx, “Robust transmission in downlink multiuser MISO systems: A rate-splitting approach,” IEEE Transactions on Signal Processing, vol. 64, no. 23, pp. 6227–6242, Dec. 2016.
  • [11] Z. Yang, M. Chen, W. Saad, and M. Shikh-Bahaei, “Optimization of rate allocation and power control for rate splitting multiple access (RSMA),” IEEE Transactions on Communications, vol. 69, no. 9, pp. 5988-6002, Jun. 2021.
  • [12] S. Guo, and X. Zhou, “Robust power allocation for NOMA in heterogeneous vehicular communications with imperfect channel estimation,” in 2017 IEEE 28th Annual International Symposium on Personal, Indoor, and Mobile Radio Communications, Oct. 2017, pp. 1-5.
  • [13] H. Xia, Y. Mao, B. Clerckx, X. Zhou, S. Han, and C. Li, “Weighted sum-rate maximization for rate-splitting multiple access based secure communication,” arXiv preprint, arXiv:2201.08472, Jan. 2022.
  • [14] H. Fu, S. Feng, W. Tang, and D. W. K. Ng, “Robust secure beamforming design for two-user downlink MISO rate-splitting systems,” IEEE Transactions on Wireless Communications, vol. 19, no. 12, pp. 8351-8365, Sep. 2020.
  • [15] S. Yan, X. Zhou, J. Hu, and S. V. Hanly, “Low probability of detection communication: Opportunities and challenges,” IEEE Wireless Communications, vol. 26, no. 5, pp. 19-25, Oct. 2019.
  • [16] W. Trappe, “The challenges facing physical layer security,” IEEE Communications Magazine, vol. 53, no. 6, pp. 16-20, Jun. 2015.
  • [17] X. He and A. Yener, “Providing Secrecy When the Eaves- dropper Channel is Arbitrarily Varying: A Case for Multiple Antennas,” in 48th Annual Allerton Conference on Communication, Control and Computing, Sep. 2010, pp. 1228–35.
  • [18] S. Jana, S. N. Premnath, M. Clark, S. K. Kasera, N. Patwari, and S. V. Krishnamurthy, “On the effectiveness of secret key extraction from wireless signal strength in real environments,” in 15th International Conference on Mobile computing and Networking, Sep. 2009, pp. 321-332.
  • [19] B. A. Bash, D. Goeckel, and D. Towsley, “Limits of reliable communication with low probability of detection on AWGN channels,” IEEE Journal on Selected Areas in Communications, vol. 31, no. 9, pp. 1921-1930, Aug. 2013.
  • [20] L. Tao, W. Yang, S. Yan, D. Wu, X. Guan, and D. Chen, “Covert communication in downlink NOMA systems with random transmit power,” IEEE Wireless Communications Letters, no., 9, vol. 11, pp. 2000-2004, Jul. 2020.
  • [21] S. Yan, B. He, X. Zhou, Y. Cong, and A. L. Swindlehurst, “Delay-intolerant covert communications with either fixed or random transmit power,” IEEE Transactions on Information Forensics Security, vol. 14, no. 1, pp. 129–140, Jan. 2018.
  • [22] Y. E. Jiang, L. Wang, H. Zhao, and H. H. Chen, “Covert communications in D2D underlaying cellular networks with power domain NOMA,” IEEE Systems Journal, vol. 14, no. 3, pp. 3717-3728, Feb. 2020.
  • [23] M. Forouzesh, P. Azmi, N. Mokari, and D. Goeckel, “Robust power allocation in covert communication: Imperfect CDI,” IEEE Transactions on Vehicular Technology, vol. 70, no. 6, pp. 5789-5802, Apr. 2021.
  • [24] N. Q. Hieu, D. T. Hoang, D. Niyato, and D. I. Kim, “Optimal power allocation for rate splitting communications with deep reinforcement learning,” IEEE Wireless Communications Letters, vol. 10, no. 12, pp. 2820-2823, Oct. 2021.
  • [25] N. C. Luong, D. T. Hoang, S. Gong, D. Niyato, P. Wang, Y. C. Liang, and D. I. Kim, “Applications of deep reinforcement learning in communications and networking: A survey,” IEEE Communications Surveys & Tutorials, vol. 21. no. 4, pp. 3133-3174, May 2019.
  • [26] B. R. Kiran, I. Sobh, V. Talpaert, P. Mannion, A. A. Al Sallab, S. Yogamani, and P. Pérez, “Deep reinforcement learning for autonomous driving: A survey,” to appear in IEEE Transactions on Intelligent Transportation Systems, Feb. 2021.
  • [27] R. S. Sutton, and A. G. Barto, Reinforcement learning: An introduction, MIT press, 2018.
  • [28] J. Schulman, et al., “Proximal policy optimization algorithms,” arXiv preprint, arXiv:1707.06347, 2017.
  • [29] X. Sun, S. Yan, N. Yang, Z. Ding, C. Shen, and Z. Zhong, “Short-packet downlink transmission with non-orthogonal multiple access,” IEEE Transactions on Wireless Communications, vol. 17, no. 7, pp. 4550-4564, Apr. 2018.
  • [30] S. Yan, B. He, Y. Cong, and X. Zhou, “Covert communication with finite blocklength in AWGN channels,” in 2017 IEEE International Conference on Communications, May 2017, pp. 1-6.
  • [31] F. Shu, T. Xu, J. Hu, and S. Yan, “Delay-constrained covert communications with a full-duplex receiver,” IEEE Wireless Communications Letters, vol. 8, no. 3, pp. 813-816, Jan. 2019.
  • [32] M. Dai and B. Clerckx, “Multiuser millimeter wave beamforming strategies with quantized and statistical CSIT,” IEEE Transactions on Wireless Communications, vol. 16, no. 11, pp. 7025–7038, Nov. 2017.
  • [33] V. Mnih, et al, “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529-533, Feb. 2015.
  • [34] H. Yang, Z. Xiong, J. Zhao, D. Niyato, L. Xiao, and Q. Wu, “Deep reinforcement learning-based intelligent reflecting surface for secure wireless communications,” IEEE Transactions on Wireless Communications, vol. 20, no. 1, pp. 375-388, Sep. 2020.