This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Graph-Embedded Multi-Agent Learning for Smart Reconfigurable THz MIMO-NOMA Networks

Xiaoxia Xu,  Qimei Chen,  Xidong Mu,  Yuanwei Liu,  and Hao Jiang X. Xu, Q. Chen, and H. Jiang are with the School of Electronic Information, Wuhan University, Wuhan 430072, China (e-mail: {xiaoxiaxu, chenqimei, jh}@whu.edu.cn).X. Mu is with School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, China. (email: muxidong@bupt.edu.cn).Y. Liu is with the School of Electronic Engineering and Computer Science, Queen Mary University of London, London E1 4NS, UK. (email:yuanwei.liu@qmul.ac.uk).
Abstract

With the accelerated development of immersive applications and the explosive increment of internet-of-things (IoT) terminals, 6G would introduce terahertz (THz) massive multiple-input multiple-output non-orthogonal multiple access (MIMO-NOMA) technologies to meet the ultra-high-speed data rate and massive connectivity requirements. Nevertheless, the unreliability of THz transmissions and the extreme heterogeneity of device requirements pose critical challenges for practical applications. To address these challenges, we propose a novel smart reconfigurable THz MIMO-NOMA framework, which can realize customizable and intelligent communications by flexibly and coordinately reconfiguring hybrid beams through the cooperation between access points (APs) and reconfigurable intelligent surfaces (RISs). The optimization problem is formulated as a decentralized partially-observable Markov decision process (Dec-POMDP) to maximize the network energy efficiency, while guaranteeing the diversified users’ performance, via a joint RIS element selection, coordinated discrete phase-shift control, and power allocation strategy. To solve the above non-convex, strongly coupled, and highly complex mixed integer nonlinear programming (MINLP) problem, we propose a novel multi-agent deep reinforcement learning (MADRL) algorithm, namely graph-embedded value-decomposition actor-critic (GE-VDAC), that embeds the interaction information of agents, and learns a locally optimal solution through a distributed policy. Numerical results demonstrate that the proposed algorithm achieves highly customized communications and outperforms traditional MADRL algorithms.

Index Terms:
Reconfigurable intelligent surface, THz, MIMO-NOMA, MADRL, distributed optimization.

I Introduction

The upcoming 6G era confronts a variety of emerging applications, involving immersive applications such as ultra-high definition (UHD) video and virtual reality/augmented reality (VR/AR), as well as Internet of Things (IoT) applications like wearable devices and smart homes [1]. To meet the unprecedented challenges raised by ultra wide-band communications and massive IoT connectivity, terahertz (THz) massive multiple-input-multiple-output non-orthogonal multiple access (MIMO-NOMA) has become an essential technology for 6G, which can provide 1010 Gbps-order ultra-fast transmission speed and support millions of connections. Generally, THz MIMO-NOMA systems utilize the large-scale antenna array with hybrid beamforming structure [2], which can compensate the severe fading over high-frequency THz bands, and reduce the hardware complexity and power consumption. In addition, assisted with the MIMO-NOMA technology [3, 4], highly spatial-correlated users can be grouped into one cluster and supported by a single radio frequency (RF) chain, which can significantly improve spectral efficiency and connective density [5]. To employ the massive MIMO-NOMA technology for THz communications, the authors in [6] proposed an energy-efficient user clustering, hybrid precoding, and power optimization scheme, where the blockage probability and unreliability of line-of-sight (LoS) links have been ignored. Nevertheless, due to the high obscuration susceptibility, the application of THz MIMO-NOMA network may suffer from serious transmission unreliability and intermittency resulting from either wall blockage or human-body blockage effect [7], which may significantly degrade the experiences of unreliability-sensitive 6G immersive applications.

Fortunately, the newly-emerged reconfigurable intelligent surface (RIS) technology is regarded as a promising way to overcome the shortcomings in THz MIMO-NOMA communications [8, 9, 10]. Specifically, RISs can dynamically transform and reshape spatial beams, and thus construct virtual LoS links between transmitters and receivers to avoid blockage [11]. Meanwhile, a smart radio environment can also be created to achieve significant spectrum/energy efficiency improvement and flexible scheduling [12, 13]. Given the aforementioned benefits, increasingly research efforts have been devoted to the RIS-aided MIMO-NOMA networks operating in low frequencies. In [14], the authors proposed a joint passive and active beamforming method for RIS-aided multiple input single output (MISO)-NOMA network, which obtained a locally optimal solution based on a second-order cone programming (SOCP)-alternating direction method of multipliers (ADMM) algorithm. The authors in [15] further studied the joint passive and active beamforming in RIS-aided MISO-NOMA networks under both ideal and non-ideal RIS assumptions. Furthermore, the authors in [16] proposed joint deployment and beamforming frameworks for RIS-aided MIMO-NOMA networks based on deep reinforcement learning. However, the existing methods are inapplicable to 6G THz MIMO-NOMA communications: 1) The existing MIMO-NOMA mechanisms are incapable to deal with the extremely heterogenous quality-of-service (QoS) requirements of 6G users. 2) Compared with low-frequency MIMO communications, THz MIMO communications face a more prominent transmission unreliability problem. 3) Since THz MIMO-NOMA networks usually have high-dimensional spatial channel information, existing centralized and iterative optimization algorithms usually lead to unacceptable high complexity and information exchange overhead to schedule the complicated THz MIMO-NOMA scenarios.

Therefore, we aim to propose a novel smart reconfigurable MIMO-NOMA THz framework that can realize customizable and intelligent indoor communications with high energy efficiency and low complexity based on a machine learning mechanism in this work. Here, we consider two types of heterogeneous users, namely IoT users and super-fast-experience (SE) users. Specifically, SE users, like VR/AR and UHD video, require ultra-high-speed and reliable immersive communication experiences, while the densely connected IoT users, like smart cleaners and smart watches, tolerate sporadic and unreliability-tolerant traffic transmission. Different from the conventional systems, we introduce a low-complexity and decentralized learning-based framework that can jointly design the user clustering, NOMA decoding, and hybrid beam reconfiguration schemes in a cooperative setting with multiple APs and RISs: 1) By adaptively aligning users’ equivalent spatial channels as well as customizing NOMA decoding orders based on the QoS requirements, the intra-cluster interference suffered by SE users can be completely eliminated. 2) We adjust the highly-directional hybrid beams through the cooperation among APs and RISs, which can ensure tailored spatial data channels and mitigate both inter-cluster and inter-AP interference via active hybrid beamforming and coordinated passive beamforming. 3) To overcome the non-ideal discrete phase-shift control, we exploit a dynamic RIS element selection structure for the hybrid beam reconfiguration, which can flexibly refrain unfavorable and negative reflections via an element-wise ON/OFF control to enhance energy efficiency. Overall, the proposed framework can realize customizable hybrid spatial and power-domain multiplexing, as well as improving the multi-domain resource utilization.

Based on the above framework, we propose a long-term joint RIS element selection, coordinated discrete phase-shift control, and power allocation learning strategy. The objective function is formulated to maximize the system energy efficiency as well as satisfying users’ data rate and reliability, which is an NP-hard mixed-integer nonlinear programming (MINLP) problem. To efficiently solve the non-convex, strongly coupled, and highly complex MINLP problem online, we transfer it into a decentralized partially observable Markov decision process (Dec-POMDP). Thereafter, we introduce a novel cooperative multi-agent reinforcement learning (MADRL) method, namely graph-embedded value-decomposition actor-critic (GE-VDAC), to efficiently coordinate multi-AP and multi-RIS in a distributed manner. The proposed GE-VDAC algorithm can not only improve generalization ability of multi-agent learning, but also reduce information exchange overhead.

The main contributions of this work can be summarized as follows.

  • We propose a novel smart reconfigurable THz MIMO-NOMA framework, which can realize highly customizable and intelligent communications to support ultra-wide bands and ultra-dense connections. The hybrid spatial beams are smartly and cooperatively reconfigured through multi-AP and multi-RIS coordinations, where dynamic RIS element selection, on-demand data enhancement, flexible interference suppression, and efficient hybrid spatial and power-domain multiplexing are allowed.

  • The long-term joint RIS element selection, coordinated discrete phase-shift control, and power allocation optimization problem is formulated by a Dec-POMDP model. Under customized user clustering, NOMA decoding, and sub-connected hybrid beamforming schemes, the Dec-POMDP model can maximize the expected system energy efficiency while satisfy extremely heterogeneous data rate and reliability requirements for different users.

  • To efficiently solve the non-convex, strongly coupled, and highly complex MINLP problem online, we propose a novel distributed MADRL algorithm, namely GE-VDAC, which learns the decomposed local policies by embedding the agents’ interaction information into dimension-reduced and permutation-invariant features. We show that the proposed GE-VDAC can not only converge to a locally optimal solution, but also achieve a better coordination and generalization with low information exchange overhead.

  • We present numerical results to verify the effectiveness of the proposed strategy. The proposed GE-VDAC achieves higher system energy efficiency and faster learning speed than traditional MADRL algorithms. Moreover, both reliable and ultra-high-speed communications can be achieved by SE users despite the increment of connected users.

The rest of this paper is organized as follows. Section II describes the smart reconfigurable THz MIMO-NOMA network. Section III presents the formulated Dec-POMDP problem, and Section VI proposes the GE-VDAC based distributed MADRL algorithm. Numerical results are presented in Section V before the conclusion in Section VI.

Notation: We denote the imaginary unit by jj, and represent vectors and matrices by lower and upper boldface symbols, respectively. Fx\frac{\partial F}{\partial x} denotes the first partial derivative of function FF with respect to xx. 𝔼[]\mathbb{E}\left[\cdot\right] represents the statistical expectation. |||\cdot| and \|\cdot\| denote the absolute value and the Euclidean norm, respectively. Moreover, \star means the permutation operation. The main notations used throughout this paper is summarized in Table I.

TABLE I: Main Notations
Notations Definitions
NAN_{\mathrm{A}}, NRN_{\mathrm{R}} number of antennas and RF chains
NsubN_{\mathrm{sub}} number of antennas in a subset
𝐕m,𝐖m\mathbf{V}^{m},\mathbf{W}^{m} analog and digital beamformers at AP mm
KSK_{\mathrm{S}}, KUK_{\mathrm{U}} number of SE and IoT users associated to each AP
Knm(t)K_{n}^{m}(t) (𝒦nm(t)\mathcal{K}_{n}^{m}(t)) user number (set) served by RF chain nn of AP mm
𝚯j(t)\bm{\Theta}^{j}(t) phase-shift matrix of RIS jj
ωlj(t)\omega_{l}^{j}(t), θlj(t)\theta_{l}^{j}(t) ON/OFF state and phase shift of element ll at RIS jj
pnkm(t)p_{nk}^{m}(t), Pnm(t)P_{n}^{m}(t) power allocation of user UnkmU_{nk}^{m} and cluster nn
𝐡nkim(t)\mathbf{h}_{nk}^{im}(t) equivalent channel from AP ii to user UnkmU_{nk}^{m}
𝐡~nkim(t)\widetilde{\mathbf{h}}_{nk}^{im}(t) direct channel from AP ii to user UnkmU_{nk}^{m}
𝐟nkjm(t)\mathbf{f}_{nk}^{jm}(t) channel from RIS jj to user UnkmU_{nk}^{m}
𝐆ij(t)\mathbf{G}^{ij}(t) channel from AP ii to RIS jj
qnkm(t)q_{nk}^{m}(t), Ynkm(t)Y_{nk}^{m}(t) traffic queue and virtual queue of user UnkmU_{nk}^{m}
𝛀(t)\mathbf{\Omega}(t), 𝐬(t)\mathbf{s}(t), 𝐚(t)\mathbf{a}(t) joint observation, state, and action vectors
𝐨iu(t)\mathbf{o}_{i}^{u}(t) vectorized node feature of the uu-th type agent iui\in\mathcal{I}^{u}
𝐝ii(t)\mathbf{d}_{ii^{\prime}}(t) vectorized feature of edge (i,i)(i,i^{\prime})
𝐞ii(t)\mathbf{e}_{ii^{\prime}}(t) embedded feature of edge (i,i)(i,i^{\prime})
𝐳iu(t)\mathbf{z}_{i}^{u}(t), 𝐳~iu(t)\widetilde{\mathbf{z}}_{i}^{u}(t) hidden and local embedded states of agent iui\in\mathcal{I}^{u}

II Smart Reconfigurable THz MIMO-NOMA Framework

Refer to caption
Figure 1: The proposed smart reconfigurable THz massive MIMO-NOMA network.

As shown in Fig. 1, we consider an indoor downlink massive MIMO-NOMA THz network that serves densely distributed SE users and IoT users under multiple pre-installed THz APs and RISs. Denote the set of MM APs and JJ RISs as ={1,2,,M}\mathcal{M}=\{1,2,...,M\} and 𝒥={1,2,,J}\mathcal{J}=\{1,2,...,J\}, respectively. By coordinating and cooperating with neighboring APs and RISs, each AP mm serves a set 𝒦Sm\mathcal{K}_{\mathrm{S}}^{m} of KSmK_{\mathrm{S}}^{m} SE users and a set 𝒦Um\mathcal{K}_{\mathrm{U}}^{m} of KUmK_{\mathrm{U}}^{m} IoT users, which have utterly diversified QoS requirements in terms of transmission data rate and reliability. For simplicity, we assume KS1==KSM=KSK_{\mathrm{S}}^{1}=...=K_{\mathrm{S}}^{M}=K_{\mathrm{S}} and KU1==KUM=KUK_{\mathrm{U}}^{1}=...=K_{\mathrm{U}}^{M}=K_{\mathrm{U}}. Here, we denote 𝒦m=𝒦Um𝒦Sm\mathcal{K}^{m}=\mathcal{K}_{\mathrm{U}}^{m}\cup\mathcal{K}_{\mathrm{S}}^{m}, Km=KU+KSK^{m}=K_{\mathrm{U}}+K_{\mathrm{S}}, and 𝒦=𝒦1𝒦2𝒦M\mathcal{K}=\mathcal{K}^{1}\cup\mathcal{K}^{2}\cup...\cup\mathcal{K}^{M}. To reduce hardware complexity and energy dissipation, we assume each user equips with a single antenna and the APs apply the sub-connected hybrid beamforming structure [17]. Moreover, each AP is equipped with NAN_{\mathrm{A}} antennas and NRN_{\mathrm{R}} RF chains, where each RF chain connects to a subset of NsubN_{\mathrm{sub}} antennas via NsubN_{\mathrm{sub}} phase shifters with Nsub=NA/NRN_{\mathrm{sub}}=N_{\mathrm{A}}/N_{\mathrm{R}}. Define the analog and digital beamforming matrixes at each AP mm as 𝐕mNA×NR\mathbf{V}^{m}\in\mathbb{C}^{N_{\mathrm{A}}\times N_{\mathrm{R}}} and 𝐖mNR×NR\mathbf{W}^{m}\in\mathbb{C}^{N_{\mathrm{R}}\times N_{\mathrm{R}}}. Denote the phase-shift matrix of each RIS jj by 𝚯j\bm{\Theta}^{j}. Based on the MIMO-NOMA technology, the highly spatial-correlated SE and IoT users can be grouped into one cluster, which is served by a reconfigured beam transmitted from an AP antenna subarray that is connected with a RF chain. Define UnkmU_{nk}^{m} as user kk in cluster nn under AP mm, and 𝒦nm\mathcal{K}_{n}^{m} be the set of KnmK_{n}^{m} users in cluster nn under AP mm. Furthermore, the whole system is divided into TT time slots, indexed by 𝒯={1,2,,T}\mathcal{T}=\{1,2,...,T\}.

In general, we aim to propose a learning-based mechanism for the smart reconfigurable THz MIMO-NOMA network to jointly coordinate both multiple APs and RISs, which has the flow chart as shown in Fig. 2. Decentralized scheduling begins after channel estimation, which contains the process of learning-based decentralized policies and low-complexity schemes. In the learning-based decentralized policies, we can determine the dynamic RIS element selection, discrete phase-shift control, and power allocation with incomplete system information. Based on the smartly reconfigured channels, each AP can further obtain QoS-based user clustering, customized NOMA decoding order, and sub-connected hybrid beamformer using low-complexity schemes. Thereafter, data transmissions can be performed.

Refer to caption
Figure 2: Process of the proposed learning-based mechanism.

Without loss of generality, we assume each RIS equips with a low-cost RF chain that can estimate the CSI based on the semi-passive RIS channel estimation methods [18]. Specifically, at the beginning of each time slot, APs would send pilot signals to neighboring users to estimate the direct channels. Meanwhile, RISs would turn off their reflecting elements to sense the channels, where the indirect channels can be estimated by processing the received signals. Both direct and indirect channels can be estimated under compressed sensing or deep learning methods, which are out of the scope of this work. For the sake of expression, we ignore the time slot index tt from Section II-A to II-G.

II-A RIS Element Selection and Discrete Phase-Shift Control

We suppose that each RIS jj consists of LL reflecting elements, which is controlled by a software-defined RIS controller. Since the continuous phase control is hard to realize in practice, a finite-resolution passive phase shifter is utilized for each reflecting element. To save energy consumption as well as overcoming the non-ideal reflecting effect due to the discrete phase-shift control, we propose to leverage a dynamic RIS element selection structure during transmissions111Note that dynamic RIS element selection requires a complex design for the RIS array, which might be realized in the near future. In this work, we assume the reflecting elements in RISs can be dynamically turned ON/OFF by controlling the PIN diodes.. By flexibly controlling the ON/OFF states and the phase shifts of passive phase shifters, the dynamic selection structure can prevent unfavorable reflections and achieve higher energy efficiency.

Based on the dynamic selection structure, the phase-shift matrix 𝚯j\bm{\Theta}^{j} on RIS jj can be given by

𝚯j=diag(ω1jejθ1j,,ωLjejθLj),\bm{\Theta}^{j}=\mathrm{diag}\left(\omega_{1}^{j}e^{j\theta_{1}^{j}},...,\omega_{L}^{j}e^{j\theta_{L}^{j}}\right), (1)

where ωlj\omega_{l}^{j} denotes the ON/OFF state of the ll-th RIS element, given by

ωlj={1,if reflecting element l at RIS j is turned ON,0,otherwise.\omega_{l}^{j}\!=\!\begin{cases}1,&\text{if reflecting element $l$ at RIS $j$ is turned ON,}\\ 0,&\text{otherwise.}\end{cases} (2)

The discrete phase shift of a selected element ll at RIS jj is determined by an integer βlj{0,1,,2BR1}\beta_{l}^{j}\in\{0,1,...,2^{B^{\mathrm{R}}-1}\} and the RIS resolution bit BRB^{\mathrm{R}}, which can be written as

θlj={21bπβlj|βlj{0,1,,2BR1}}.\theta_{l}^{j}\in\mathcal{F}=\left\{2^{1-b}\pi\beta_{l}^{j}\big{|}\beta_{l}^{j}\in\{0,1,...,2^{B^{\mathrm{R}}-1}\}\right\}. (3)

II-B Channel Model

The equivalent channel vector from AP ii\in\mathcal{M} to user UnkmU_{nk}^{m} via multiple RISs is defined as 𝐡nkim=[𝐡nkim1,,𝐡nkimNR]1×NA\mathbf{h}_{nk}^{im}=[\mathbf{h}_{nk}^{im1},...,\mathbf{h}_{nk}^{imN_{\mathrm{R}}}]\in\mathbb{C}^{1\times N_{\mathrm{A}}}, where 𝐡nkims1×Nsub\mathbf{h}_{nk}^{ims}\in\mathbb{C}^{1\times N_{\mathrm{sub}}} is the equivalent channel from subarray ss at AP ii to user UnkmU_{nk}^{m}. Here, 𝐡nkim\mathbf{h}_{nk}^{im} is given by

𝐡nkim=𝐡~nkimAP-user+j=1J𝐟nkjm𝚯j𝐆ijAP-RIS-user,\mathbf{h}_{nk}^{im}=\underset{\text{AP-user}}{\underbrace{\widetilde{\mathbf{h}}_{nk}^{im}}}+\sum_{j=1}^{J}\underset{\text{AP-RIS-user}}{\underbrace{\mathbf{f}_{nk}^{jm}\bm{\Theta}^{j}\mathbf{G}^{ij}}}, (4)

where 𝐡~nkim1×NA\widetilde{\mathbf{h}}_{nk}^{im}\in\mathbb{C}^{1\times N_{\mathrm{A}}}, 𝐟nkjm1×L\mathbf{f}_{nk}^{jm}\in\mathbb{C}^{1\times L}, and 𝐆ijL×NA\mathbf{G}^{ij}\in\mathbb{C}^{L\times N_{\mathrm{A}}} denote the direct spatial channels of AP ii-user UnkmU_{nk}^{m}, RIS jj-user UnkmU_{nk}^{m}, and AP ii-RIS jj links, respectively.

We assume LoS paths always exist between the APs and its neighboring RISs. However, the LoS paths for the AP-user and RIS-user links may be blocked. Therefore, channel 𝐡~nkim\mathbf{\widetilde{h}}_{nk}^{im} can be modeled by

𝐡~nkim=𝟙nkim,LoSNALpath(f,dnkim,AU)Υ𝜶H(φnkim0)+l=1nNLNALpath(f,dnkim,AU)λnkiml(f)Υ𝜶H(φnkiml),\begin{split}&\mathbf{\widetilde{h}}_{nk}^{im}\!=\!\mathbbm{1}_{nk}^{im,\mathrm{LoS}}\sqrt{N_{\mathrm{A}}}\sqrt{L_{\mathrm{path}}(f,d_{nk}^{im,\mathrm{AU}})}\varUpsilon\bm{\alpha}^{H}\left(\varphi_{nk}^{im0}\right)\\ &+\sum_{l=1}^{n_{\mathrm{NL}}}\sqrt{N_{\mathrm{A}}}\sqrt{L_{\mathrm{path}}(f,d_{nk}^{im,\mathrm{AU}})}\lambda_{nk}^{iml}(f)\varUpsilon\bm{\alpha}^{H}\left(\varphi_{nk}^{iml}\right),\end{split} (5)

where Υ\varUpsilon is the antenna gain, Lpath(f,dnkim,AU)L_{\mathrm{path}}(f,d_{nk}^{im,\mathrm{AU}}) is the path loss determined by the frequency ff and the distance dnkim,AUd_{nk}^{im,\mathrm{AU}} between AP ii and user UnkmU_{nk}^{m}. 𝟙nkim,LoS{0,1}\mathbbm{1}_{nk}^{im,\mathrm{LoS}}\in\{0,1\} indicates the existence of the LoS path, which is determined by the LoS probability based on the indoor THz blockage model [7]. Moreover, nNLn_{\mathrm{NL}} signifies the number of NLoS paths, and λnkiml\lambda_{nk}^{iml} is the reflection coefficient for NLoS path ll [19]. φnkiml\varphi_{nk}^{iml} denotes the angle of departure (AoD) for the downlink channel between AP ii and user UnkmU_{nk}^{m}. Given AoD φ\varphi, the array response vector can be denoted by 𝜶(φ)=1NA[1,,ejπ[msinφ],,ejπ[(NA1)sinφ]]T\bm{\alpha}(\varphi)=\frac{1}{\sqrt{N_{\mathrm{A}}}}\left[1,...,e^{j\pi[m\sin\varphi]},...,e^{j\pi[(N_{\mathrm{A}}-1)\sin\varphi]}\right]^{T}. According to [20], the path loss Lpath(f,d)L_{\mathrm{path}}(f,d) includes both spreading loss and absorption loss, i.e.,

Lpath(f,d)[dB]=Lspread(f,d)[dB]+Labsorption(f,d)[dB]=20log10(c4πfd)10k(f)dlog10e,\begin{split}L_{\mathrm{path}}(f,d)[dB]&\!=\!L_{\mathrm{spread}}(f,d)[dB]\!+\!L_{\mathrm{absorption}}(f,d)[dB]\\ &\!=\!20\log_{10}\!\left(\frac{c}{4\pi fd}\right)\!\!-10k(f)d\log_{10}e,\end{split} (6)

where k(f)k(f) is the frequency-dependent medium absorption coefficient, dd is the distance, and cc is the light speed.

II-C User Clustering

In traditional CSI-based massive MIMO-NOMA networks, the user clustering usually contains a cluster-head selection (CHS) procedure that chooses cluster head based on channel conditions, followed by which the remaining users are grouped with the highest channel-correlated cluster heads [21]. However, the CSI-based user clustering is inflexible to guarantee the ultra-high data rate and mitigate interference for SE users. To achieve customizable and intelligent communication experiences, we extend the traditional CSI-based user clustering to a QoS-based user clustering. Relying on the smartly reconfigured beams, users’ spatial channels can be flexibly aligned to achieve adaptive user clustering. Specifically, we group multiple IoT users with a single SE user. In this way, we can ensure multiplexing gain for SE users while enhancing IoT user connections. In light of this, we assume the number of RF chains activated at each AP is equal to its serving SE users, i.e., NR=KSN_{\mathrm{R}}=K_{\mathrm{S}}. Let Un1mU_{n1}^{m} represent the SE user, and Unkm,k>1U_{nk}^{m},\forall k>1, be the IoT user. The low-complexity user clustering scheme based on the reconfigured channels can be stated as follows.

The SE users are firstly selected as the cluster heads of different clusters. Then, we define the channel spatial correlation 𝒞(𝐡kmm,𝐡n1mm)\mathcal{C}\left(\mathbf{h}_{k}^{mm},\mathbf{h}_{n1}^{mm}\right) between an ungrouped IoT user kk and the cluster head Un1mU_{n1}^{m} as

𝒞(𝐡km,𝐡n1m)=|𝐡n1m(𝐡km)H|𝐡km𝐡n1m.\mathcal{C}\left(\mathbf{h}_{k}^{m},\mathbf{h}_{n1}^{m}\right)=\frac{\left|\mathbf{h}_{n1}^{m}\left(\mathbf{h}_{k}^{m}\right)^{H}\right|}{\|\mathbf{h}_{k}^{m}\|\|\mathbf{h}_{n1}^{m}\|}. (7)

Thereafter, multiple IoT users can be non-orthogonally grouped into the same cluster, where the cluster head achieves the strongest reconfigured channel correlations with them. The computational complexity of the QoS-based clustering scheme is 𝒪(KUNR)\mathcal{O}\left(K_{\mathrm{U}}N_{\mathrm{R}}\right). In comparison, the conventional CSI-based clustering scheme has the computational complexity of 𝒪(NR(KU+KS)2)\mathcal{O}\left(N_{\mathrm{R}}\left(K_{\mathrm{U}}+K_{\mathrm{S}}\right)^{2}\right) [21].

II-D Customized NOMA Decoding

Relying on the adaptively aligned channels through RISs, the NOMA user decoding order in each cluster can be flexibly customized based on QoS requirements without degrading the system performance. Exploiting the NOMA successive interference cancellation (SIC), SE users can subtract signals of other users within the same cluster, and completely eliminate intra-cluster interference based on power-domain multiplexing to realize ultra-high-speed and reliability-guaranteed communications. To realize this goal, each SE user with the highest QoS requirements in its cluster would be decoded at last, while the data signals of Knm1K_{n}^{m}-1 IoT users in each cluster nn would be decoded first.

To ensure the SIC at each IoT users in the same cluster can be successfully carried out, the NOMA decoding order of IoT users are rearranged according to equivalent channel gains as |𝐡n2m𝐕m𝐰nm|2|𝐡n3m𝐕m𝐰nm|2|𝐡nKnmm𝐕m𝐰nm|2|\mathbf{h}_{n2}^{m}\mathbf{V}^{m}\mathbf{w}_{n}^{m}|^{2}\geq|\mathbf{h}_{n3}^{m}\mathbf{V}^{m}\mathbf{w}_{n}^{m}|^{2}\geq...\geq|\mathbf{h}_{nK_{n}^{m}}^{m}\mathbf{V}^{m}\mathbf{w}_{n}^{m}|^{2}. Thereafter, the SE user would sequentially decode the signals of the former Knm1K_{n}^{m}-1 IoT users to completely cancel intra-cluster interference. To successfully cancel interference from IoT user Unkm,k>1U_{nk}^{m},\forall k>1, imposed to SE user Un1mU_{n1}^{m}, we have the following SIC constraint

γn1nkmγnknkm,k>1,\gamma_{n1\to nk}^{m}\geq\gamma_{nk\to nk}^{m},\forall k>1, (8)

where γn,kn,km\gamma_{n,k\to n,k}^{m} is the signal-to-interference-and-noise ratio (SINR) of IoT user UnkmU_{nk}^{m} for decoding its signal, and γn,1n,km\gamma_{n,1\to n,k}^{m} is the SINR to decode the signal of IoT user UnkmU_{nk}^{m} at SE user Un1mU_{n1}^{m}. We denote 𝟙nkm,fail=0\mathbbm{1}_{nk}^{m,\mathrm{fail}}=0 if the SIC constraint (8) can be satisfied, i.e., the SE user Un1mU_{n1}^{m} can cancel the intra-cluster interference from IoT user UnkmU_{nk}^{m}. Otherwise 𝟙nkm,fail=1\mathbbm{1}_{nk}^{m,\mathrm{fail}}=1, and the intra-cluster interference from IoT user UnkmU_{nk}^{m} would be taken as noise at the SE user. In this way, γn,1n,km\gamma_{n,1\to n,k}^{m} can be expressed as

γn,1n,km=|𝐡n1mm𝐕m𝐰nm|2pnkm×[(k=1k1pnkm+k>k𝟙nkm,failpnkm)|𝐡n1mm𝐕m𝐰nm|2+(m,n)(m,n)|𝐡n1mm𝐕m𝐰nm|2Pnm+σ2]1,\small\begin{split}&\gamma_{n,1\to n,k}^{m}\!=\!{\left|\mathbf{h}_{n1}^{mm}\mathbf{V}^{m}\mathbf{w}_{n}^{m}\right|^{2}\!p_{nk}^{m}}\!\times\!\!\bigg{[}\!\bigg{(}\sum_{k^{\prime}=1}^{k-1}p_{nk^{\prime}}^{m}\!\!+\!\!\sum_{k^{\prime}>k}\!\mathbbm{1}_{nk^{\prime}}^{m,\mathrm{fail}}p_{nk^{\prime}}^{m}\bigg{)}\!\\ &\left|\mathbf{h}_{n1}^{mm}\mathbf{V}^{m}\mathbf{w}_{n}^{m}\right|^{2}+\sum_{(m^{\prime},n^{\prime})\atop\neq(m,n)}\left|\mathbf{h}_{n1}^{m^{\prime}m}\mathbf{V}^{m^{\prime}}\mathbf{w}_{n^{\prime}}^{m^{\prime}}\right|^{2}P_{n^{\prime}}^{m^{\prime}}+\sigma^{2}\bigg{]}^{-1},\end{split} (9)

where σ2\sigma^{2} is the power spectral density of additional white Gaussian noise (AWGN), pnkmp_{nk}^{m} is the power allocation coefficient for user UnkmU_{nk}^{m}, and Pnm=k=0KnmpnkmP_{n}^{m}=\sum_{k=0}^{K_{n}^{m}}p_{nk}^{m} is the power consumption of RF chain nn at AP mm.

II-E Sub-Connected Hybrid Beamforming

We design the analog beamforming at each AP to increase effective data gain of each cluster, and design the digital beamforming for suppressing residual inter-cluster interference that is not mitigated by RISs based on zero-forcing (ZF).

The analog beamforming matrix with the sub-connected structure can be formulated as

𝐕m=[𝐯m11sub𝐯m1Nsub𝐯m21sub𝐯m2Nsub𝐯mN1sub𝐯mNNsub]=[𝐯m11sub𝟎𝟎𝟎𝟎𝐯mNNsub],\mathbf{V}^{m}\!=\!\left[\!\begin{array}[]{ccc}\mathbf{v}_{m11}^{\mathrm{sub}}&...&\mathbf{v}_{m1N}^{\mathrm{sub}}\\ \mathbf{v}_{m21}^{\mathrm{sub}}&...&\mathbf{v}_{m2N}^{\mathrm{sub}}\\ ...&...&...\\ \mathbf{v}_{mN1}^{\mathrm{sub}}&...&\mathbf{v}_{mNN}^{\mathrm{sub}}\end{array}\!\right]\!=\!\left[\!\begin{array}[]{cccc}\mathbf{v}_{m11}^{\mathrm{sub}}&...&\bm{0}\\ \bm{0}&...&\bm{0}\\ ...&...&...\\ \bm{0}&...&\mathbf{v}_{mNN}^{\mathrm{sub}}\end{array}\!\right]\!, (10)

where 𝐯mnisub=𝟎\mathbf{v}_{mni}^{\mathrm{sub}}=\mathbf{0} for nin\neq i, and 𝐯mnnsubNsub×1\mathbf{v}_{mnn}^{\mathrm{sub}}\in\mathbb{C}^{N_{\mathrm{sub}}\times 1} has an amplitude 1Nsub\frac{1}{\sqrt{N_{\mathrm{sub}}}}. The sub-connected analog beamforming is designed to improve the effective gains toward cluster heads that are highly spatial-correlated to the cluster members. Considering BAB^{\mathrm{A}}-bits quantized phase shifters in each antenna subarray, the ii-th element of the active analog beamforming vector 𝐯mnnsub\mathbf{v}_{mnn}^{\mathrm{sub}} is given by [6, 17]

𝐯mnnsub(i)=1Nsubej2πϖmn2BA,i{1,2,..,Nsub},\mathbf{v}_{mnn}^{\mathrm{sub}}(i)=\frac{1}{\sqrt{N_{\mathrm{sub}}}}e^{-j\frac{2\pi\varpi_{mn}^{*}}{2^{B^{\mathrm{A}}}}},i\in\{1,2,..,N_{\mathrm{sub}}\}, (11)

where ϖmn\varpi_{mn}^{*} is the quantized phase calculated by

ϖmn=argminϖ{0,1,,2BA1}|ej2πϖ2BA𝐡n1mm(i)|𝐡n1mm(i)||.\varpi_{mn}^{*}={\arg\min}_{\varpi\in\{0,1,...,2^{B^{\mathrm{A}}}-1\}}\left|e^{j\frac{2\pi\varpi}{2^{B^{\mathrm{A}}}}}-\frac{\mathbf{h}_{n1}^{mm}(i)}{\left|\mathbf{h}_{n1}^{mm}(i)\right|}\right|. (12)

To further suppress the effective inter-cluster interference that hasn’t been canceled by RISs, we design the ZF digital beamforming as follows. Let 𝐰nmNR×1\mathbf{w}_{n}^{m}\in\mathbb{C}^{N_{\mathrm{R}}\times 1} denote the digital beamforming vector for cluster nn at AP mm. The cluster center is given as 𝐡^nmm=1Knmk𝒦nm(𝐡nkmm𝐕m)T\widehat{\mathbf{h}}_{n}^{mm}=\frac{1}{K_{n}^{m}}\sum_{k\in\mathcal{K}_{n}^{m}}\left(\mathbf{h}_{nk}^{mm}\mathbf{V}^{m}\right)^{T}. Denote 𝐇^mm=[𝐡^1mm,𝐡^2mm,,𝐡^NRmm]\widehat{\mathbf{H}}^{mm}=\left[\widehat{\mathbf{h}}_{1}^{mm},\widehat{\mathbf{h}}_{2}^{mm},...,\widehat{\mathbf{h}}_{N_{\mathrm{R}}}^{mm}\right]. Therefore, the ZF digital beamforming is calculated as

𝐖^m=[𝐰^1m,,𝐰^NRm]=(𝐇^mm)H[𝐇^mm(𝐇^mm)H]1.\begin{split}\widehat{\mathbf{W}}^{m}&=\left[\widehat{\mathbf{w}}_{1}^{m},...,\widehat{\mathbf{w}}_{N_{\mathrm{R}}}^{m}\right]\\ &=\left(\widehat{\mathbf{H}}^{mm}\right)^{H}\left[\widehat{\mathbf{H}}^{mm}\left(\widehat{\mathbf{H}}^{mm}\right)^{H}\right]^{-1}.\end{split} (13)

By introducing column power normalizing, the baseband precoding matrix can be expressed as

𝐰nm=𝐰^nm𝐕m𝐰^nm2.\mathbf{w}_{n}^{m}=\frac{\widehat{\mathbf{w}}_{n}^{m}}{\|\mathbf{V}^{m}\widehat{\mathbf{w}}_{n}^{m}\|_{2}}. (14)

II-F Transmission Rate

Define 𝔼{snkm(snkm)H}=1\mathbb{E}\{{s}_{nk}^{m}\left({s}_{nk}^{m}\right)^{H}\}=1 as the transmitted symbols with normalized power, and xnm=k=0Knmpnkmsnkmx_{n}^{m}=\sum_{k=0}^{K_{n}^{m}}\sqrt{p_{nk}^{m}}s_{nk}^{m} as the transmitted signal in cluster nn. Define znkmz_{nk}^{m} as the AWGN. The baseband signal received by each SE user Un1mU_{n1}^{m} after SIC can be formulated as

yn1m=𝐡n1mm𝐕m𝐰nmpn1msn1mdesired signal+𝐡n1mm𝐕m𝐰nmk>1𝟙nkm,failpnkmsnkmresidual intra-cluster interference+(m,n)(m,n)𝐡n0mm𝐕m𝐰nmxnminter-cluster interference+zn1mnoise.\small\begin{split}y_{n1}^{m}&\!=\!\underset{\text{desired signal}}{\underbrace{\mathbf{h}_{n1}^{mm}\mathbf{V}^{m}\mathbf{w}_{n}^{m}\sqrt{p_{n1}^{m}}s_{n1}^{m}}}+\!\underset{\text{residual intra-cluster interference}}{\underbrace{\mathbf{h}_{n1}^{mm}\mathbf{V}^{m}\mathbf{w}_{n^{\prime}}^{m}\!\!\sum_{k>1}\!\mathbbm{1}_{nk}^{m,\mathrm{fail}}\!\sqrt{p_{nk}^{m}}s_{nk}^{m}}}\\ &+\!\underset{\text{inter-cluster interference}}{\underbrace{\sum_{(m^{\prime},n^{\prime})\neq(m,n)}\mathbf{h}_{n0}^{m^{\prime}m}\mathbf{V}^{m^{\prime}}\mathbf{w}_{n^{\prime}}^{m^{\prime}}x_{n^{\prime}}^{m^{\prime}}}}+\underset{\text{noise}}{\underbrace{z_{n1}^{m}}}.\end{split} (15)

Therefore, the SINR of each SE user Un1mU_{n1}^{m} can be written as

γn1m=|𝐡n1mm𝐕m𝐰nm|2pn1m(k>1𝟙nkm,fail|𝐡n1mm𝐕m𝐰nm|2pnkm+(m,n)(m,n)|𝐡n1mm𝐕m𝐰nm|2Pnm+σ2)1.\begin{split}\gamma_{n1}^{m}&={\left|\mathbf{h}_{n1}^{mm}\mathbf{V}^{m}\mathbf{w}_{n}^{m}\right|^{2}p_{n1}^{m}}\bigg{(}\sum_{k>1}\mathbbm{1}_{nk}^{m,\mathrm{fail}}\left|\mathbf{h}_{n1}^{mm}\mathbf{V}^{m}\mathbf{w}_{n}^{m}\right|^{2}p_{nk}^{m}\\ &+\sum_{(m^{\prime},n^{\prime})\neq(m,n)}\left|\mathbf{h}_{n1}^{m^{\prime}m}\mathbf{V}^{m^{\prime}}\mathbf{w}_{n^{\prime}}^{m^{\prime}}\right|^{2}P_{n^{\prime}}^{m^{\prime}}+\sigma^{2}\bigg{)}^{-1}.\end{split} (16)

Moreover, the SINR of IoT users Unkm,k>1U_{nk}^{m},\forall k>1, can be given by

γnkm=|𝐡nkmm𝐕m𝐰nm|2pnkm(k=1k1|𝐡nkmm𝐕m𝐰nm|2pnkm+(m,n)(m,n)|𝐡nkmm𝐕m𝐰nm|2pnm+σ2)1.\begin{split}\gamma_{nk}^{m}&=\left|\mathbf{h}_{nk}^{mm}\mathbf{V}^{m}\mathbf{w}_{n}^{m}\right|^{2}p_{nk}^{m}\bigg{(}\sum_{k^{\prime}=1}^{k-1}\left|\mathbf{h}_{nk}^{mm}\mathbf{V}^{m}\mathbf{w}_{n}^{m}\right|^{2}p_{nk^{\prime}}^{m}+\\ &\sum_{(m^{\prime},n^{\prime})\neq(m,n)}\left|\mathbf{h}_{nk}^{m^{\prime}m}\mathbf{V}^{m^{\prime}}\mathbf{w}_{n^{\prime}}^{m^{\prime}}\right|^{2}p_{n^{\prime}}^{m^{\prime}}+\sigma^{2}\bigg{)}^{-1}.\end{split} (17)

Hence, the data rate of each user UnkmU_{nk}^{m} at each time slot can be expressed as

Rnkm=log2(1+γnkm),k𝒦.R_{nk}^{m}=\log_{2}\left(1+\gamma_{nk}^{m}\right),\forall k\in\mathcal{K}. (18)

II-G Power Consumption

The total power consumption for the proposed networks at each time slot includes both transmit and circuit power, which can be formulated as

P=m=1Mn=1NRξPnm+KPD+MPAP+j=1Jl=1LωljPRIS(BR),P\!=\!\sum_{m=1}^{M}\sum_{n=1}^{N_{\mathrm{R}}}\xi P_{n}^{m}+KP_{\mathrm{D}}+MP_{\mathrm{AP}}+\sum_{j=1}^{J}\sum_{l=1}^{L}\omega_{l}^{j}P_{\mathrm{RIS}}\left(B^{\mathrm{R}}\right), (19)

where ξ\xi denotes the inefficiency of the phase shifter in the THz network, PDP_{\mathrm{D}} is the circuit power consumption at each user, and PRIS(BR)P_{\mathrm{RIS}}\left(B^{\mathrm{R}}\right) denotes hardware energy dissipated at each RIS element depending on the RIS resolution bit BRB^{\mathrm{R}}[23]. Moreover, PAPP_{\mathrm{AP}} is the circuit power consumption at each AP, which is given by

PAP=PBB+NRPRF+NA(PPS+PA),P_{\mathrm{AP}}=P_{\mathrm{BB}}+N_{\mathrm{R}}P_{\mathrm{RF}}+N_{\mathrm{A}}\left(P_{\mathrm{PS}}+P_{\mathrm{A}}\right), (20)

where PBBP_{\mathrm{BB}} is the baseband power consumption, and PRF,PPSP_{\mathrm{RF}},P_{\mathrm{PS}}, and PAP_{\mathrm{A}} denote power consumption of per RF chain, per phase shifters and per power amplifies, respectively.

Therefore, the network energy efficiency η\eta at each time slot can be formulated as

η=m=1Mn=1NRk=1Knmlog2(1+γnkm)m=1Mn=1NξPnm+KPD+MPAP+j=1Jl=1LωljPRIS(BR).\small\eta\!=\!\frac{\sum_{m=1}^{M}\sum_{n=1}^{N_{\mathrm{R}}}\sum_{k=1}^{K_{n}^{m}}\log_{2}\left(1+\gamma_{nk}^{m}\right)}{\sum_{m=1}^{M}\sum_{n=1}^{N}\xi P_{n}^{m}\!+\!KP_{\mathrm{D}}\!+\!MP_{\mathrm{AP}}\!+\!\sum_{j=1}^{J}\sum_{l=1}^{L}\omega_{l}^{j}P_{\mathrm{RIS}}\left(B^{\mathrm{R}}\right)}. (21)

II-H Reliability Model

Suppose the THz MIMO-NOMA networks maintain a traffic buffer queue qnkm(t)q_{nk}^{m}(t) for each user, t𝒯\forall t\in\mathcal{T}, which aims to transmit specific data volume in a predefined time. At the beginning of each time slot tt, the queue length qnkm(t)q_{nk}^{m}(t) of user UnkmU_{nk}^{m} can be updated by

qnkm(t+1)=Ankm(t)+[qnkm(t)Rnkm(t)]+,q_{nk}^{m}(t+1)=A_{nk}^{m}(t)+\left[q_{nk}^{m}(t)-R_{nk}^{m}(t)\right]^{+}, (22)

where Rnkm(t)R_{nk}^{m}(t) is the transmission data rate of (18), Ankm(t)A_{nk}^{m}(t) denotes the traffic arrival column at time slot tt following the poisson distribution, and [x]+=max{x,0}\left[x\right]^{+}=\max\{x,0\}.

To ensure reliable transmission, we should guarantee the queue stability and keep the outages below a predefined threshold [22]. The queue stability is ensured by

𝔼[qnkm(t)]=limTt=1Tqnkm(t)<,k𝒦.\mathbb{E}\left[q_{nk}^{m}(t)\right]=\lim_{T\to\infty}\sum_{t=1}^{T}q_{nk}^{m}(t)<\infty,\forall k\in\mathcal{K}. (23)

Moreover, the outage probability, i.e., the probability that the traffic queue length qnkm(t)q_{nk}^{m}(t) of device UnkmU_{nk}^{m} exceeds the threshold qnkm,maxq_{nk}^{m,\max}, is limited by a certain threshold ϵnkm\epsilon_{nk}^{m}. Therefore, the reliability conditions can be formulated as

Pr(qnkm(t)qnkm,max)ϵnkm,k𝒦.\mathrm{Pr}\left(q_{nk}^{m}(t)\geq q_{nk}^{m,\max}\right)\leq\epsilon_{nk}^{m},\forall k\in\mathcal{K}. (24)

III Lyapunov Optimization Based Dec-POMDP Problem

In this Section, we first formulate the constrained Dec-POMDP problem, and then transfer the constrained Dec-POMDP problem into a normal Dec-POMDP problem based on the Lyapunov optimization theory.

III-A Constrained Dec-POMDP Model

We aim to dynamically find a stationary online policy π\pi, which can jointly optimize the dynamic RIS element selection, coordinated discrete phase-shift control, and transmit power allocation by observing the environment state at each time slot tt. Under the given user clustering, NOMA decoding, and hybrid beamforming schemes, the joint policy dynamically maximizes the expected system energy efficiency for the smart reconfigurable THz MIMO-NOMA network, while satisfying various QoS requirements in terms of data rate and transmission reliability of users. Therefore, the online objective function can be formulated as follows

𝒫0:max𝐚(t)limTt=0T𝔼[η(t)],\mathcal{P}_{0}:\max_{\mathbf{a}(t)}\lim_{T\to\infty}\sum_{t=0}^{T}\mathbb{E}\left[\eta(t)\right], (25a)
s.t.\displaystyle{\mathrm{s.t.}}~ Rnkm(t)Rnkm,min,k𝒦,t𝒯,\displaystyle R_{nk}^{m}(t)\geq R_{nk}^{m,\min},\forall k\in\mathcal{K},t\in\mathcal{T}, (25b)
n=1NRk=1Knm(t)pnkm(t)PAPmax,m,t𝒯,\displaystyle\sum_{n=1}^{N_{\mathrm{R}}}\sum_{k=1}^{K_{n}^{m}(t)}p_{nk}^{m}(t)\leq P_{\mathrm{AP}}^{\max},\forall m\in\mathcal{M},t\in\mathcal{T}, (25c)
ωlj(t){0,1},j𝒥,1lL,t𝒯,\displaystyle\omega_{l}^{j}(t)\in\{0,1\},\forall j\in\mathcal{J},1\leq l\leq L,t\in\mathcal{T}, (25d)
βlj(t){0,1,,2BR1},j𝒥,1lL,t𝒯,\displaystyle\beta_{l}^{j}(t)\in\{0,1,...,2^{B^{\mathrm{R}}-1}\},\forall j\in\mathcal{J},1\leq l\leq L,t\in\mathcal{T}, (25e)
(23),(24),\displaystyle\eqref{queue},\eqref{reliability}, (25f)

where 𝐚(t)=[𝝎(t),𝜶(t),𝜷(t)]\mathbf{a}(t)=\left[\bm{\omega}(t),\bm{\alpha}(t),\bm{\beta}(t)\right]. Constraint (25b) ensures the minimum data rate of each user, (25c) denotes the total transmit power should be less than the maximum power PAPmaxP_{\mathrm{AP}}^{\max}, (25d) signifies the binary ON/OFF state of RIS elements, and constraint (25e) represents the discrete phase-shift control. Note that the SIC constraint (8) is ignored since it has been implied in the SINR expression (16) via the indicator 𝟙nkm,fail\mathbbm{1}_{nk}^{m,\mathrm{fail}}.

This online policy can be modeled as a constrained Markov Decision Process (MDP). However, solving 𝒫0\mathcal{P}_{0} in a centralized way is computationally inefficient due to the large-scale joint state-action space and the heavy overhead of high-dimensional information exchange from multiple APs and RISs to the centralized controller. To tackle 𝒫0\mathcal{P}_{0} in an efficient and low-complexity manner, we model the long-term energy efficiency optimization problem as a constrained Dec-POMDP. In detail, the POMDP provides a generalized framework to describe an MDP with incomplete information, while Dec-POMDP extends POMDP to the decentralized settings as 𝒫=(,𝒮,Ω,𝒜,P𝐬𝐬𝐚,π,r,Γ)\mathcal{P}=\left(\mathcal{I},\mathcal{S},\Omega,\mathcal{A},P_{\mathbf{s}\to\mathbf{s}^{\prime}}^{\mathbf{a}},\pi,r,\Gamma\right). Here, \mathcal{I} denotes a set of agents. 𝒮\mathcal{S}, Ω\Omega, and 𝒜\mathcal{A} denote the state, observation, and action spaces. At each slot tt, given the true state 𝐬(t)𝒮\mathbf{s}(t)\in\mathcal{S}, each agent ii partially observes a local observation, based on which it can determine a local action using decentralized policy π\pi. The joint observation and action vectors of all agents are denoted by 𝛀(t)Ω\mathbf{\Omega}(t)\in\Omega and 𝐚(t)𝒜\mathbf{a}(t)\in\mathcal{A}, and the decentralized policy π(𝐚(t)|𝐬(t))\pi(\mathbf{a}(t)|\mathbf{s}(t)) means the probability of taking joint action 𝐚(t)\mathbf{a}(t) at the joint state 𝐬(t)\mathbf{s}(t). Furthermore, P𝐬𝐬𝐚=(𝐬|𝐬,𝐚)P_{\mathbf{s}\to\mathbf{s}^{\prime}}^{\mathbf{a}}=\mathbb{P}(\mathbf{s}^{\prime}|\mathbf{s},\mathbf{a}) is the probability of transferring into a new state 𝐬\mathbf{s}^{\prime} by taking action 𝐚\mathbf{a} at state 𝐬\mathbf{s}. Given the global reward function r(𝐬(t),𝐚(t))r\left(\mathbf{s}(t),\mathbf{a}(t)\right) and the discount factor Γ\Gamma, the agents can cooperate and coordinate to maximize the discounted return. In this work, we model APs and RISs as two types of agents, whose set is denoted as =01\mathcal{I}=\mathcal{I}^{0}\cup\mathcal{I}^{1}. Let u\mathcal{I}^{u} denote the set of the uu-th type agents, where u={0,1}u=\{0,1\} represent APs and RISs respectively. We specify the formulated Dec-PoMDP as follows.

III-A1 Observation

The joint observation can be expressed as 𝛀(t)=[𝛀10(t),,𝛀M0(t),𝛀11(t),,𝛀J1(t)]\mathbf{\Omega}(t)=\left[\mathbf{\Omega}_{1}^{0}(t),...,\mathbf{\Omega}_{M}^{0}(t),\mathbf{\Omega}_{1}^{1}(t),...,\mathbf{\Omega}_{J}^{1}(t)\right], where 𝛀iu(t)\mathbf{\Omega}_{i}^{u}(t) is the local observation of agent iui\in\mathcal{I}^{u} that involves part of the environment information, given by

𝛀i0(t)=[{𝐇~ii(t)}i𝒩i0{i},𝐪i(t),𝝉i0(t)],i0,\small\mathbf{\Omega}_{i}^{0}(t)\!=\!\bigg{[}\!\left\{\bm{\widetilde{\mathrm{H}}}^{ii^{\prime}}(t)\right\}_{i^{\prime}\in\mathcal{N}_{i}^{0}\cup\{i\}},\mathbf{q}^{i}(t),\bm{\tau}_{i}^{0}(t)\!\bigg{]},\forall i\in\mathcal{I}^{0}, (26)
𝛀i1(t)=[{𝐅im(t)}m𝒩i1,{𝐆mi(t)}m𝒩i1,𝝉i1(t)],i1,\small\mathbf{\Omega}_{i}^{1}(t)=\bigg{[}\left\{\mathbf{F}^{im}(t)\right\}_{m\in\mathcal{N}_{i}^{1}},\left\{\mathbf{G}^{mi}(t)\right\}_{m\in\mathcal{N}_{i}^{1}}\!,\bm{\tau}_{i}^{1}(t)\bigg{]},\forall i\in\mathcal{I}^{1}, (27)

where 𝝉iu(t)\bm{\tau}_{i}^{u}(t) denotes the local action-observation history of agent ii, and 𝒩i0\mathcal{N}_{i}^{0} and 𝒩i1\mathcal{N}_{i}^{1} denotes the neighboring agent sets of RIS i0i\in\mathcal{I}^{0} and AP i1i\in\mathcal{I}^{1}, respectively. Furthermore, we define 𝐇~ii(t)=[{𝐡~nkii(t)}k𝒦i]\widetilde{\mathbf{H}}^{ii^{\prime}}(t)=\left[\left\{\widetilde{\mathbf{h}}_{nk}^{ii^{\prime}}(t)\right\}_{k\in\mathcal{K}^{i^{\prime}}}\right] and 𝐅im(t)=[{𝐟nkim(t)}k𝒦m]\mathbf{F}^{im}(t)=\left[\left\{\mathbf{f}_{nk}^{im}(t)\right\}_{k\in\mathcal{K}^{m}}\right] as the vectorized channels. Note that in our settings, the joint observation also represents the true environment state, i.e., 𝛀(t)=𝐬(t)\mathbf{\Omega}(t)=\mathbf{s}(t).

III-A2 Action

In this work, we consider a long-term joint RIS element selection, coordinated discrete phase-shift control, and power allocation optimization problem. Therefore, we define the joint action vector as 𝐚(t)=[𝐚10(t),,𝐚M0(t),𝐚11(t),,𝐚J1(t)]\mathbf{a}(t)=[\mathbf{a}_{1}^{0}(t),...,\mathbf{a}_{M}^{0}(t),\mathbf{a}_{1}^{1}(t),...,\mathbf{a}_{J}^{1}(t)], where 𝐚iu(t)\mathbf{a}_{i}^{u}(t) is the local action of agent iui\in\mathcal{I}^{u} that can be expressed as

𝐚iu(t)={𝐩i(t),i0,[𝝎i(t),𝜷i(t)],i1.\mathbf{a}_{i}^{u}(t)=\begin{cases}\mathbf{p}^{i}(t),&\forall i\in\mathcal{I}^{0},\\ \left[\bm{\omega}^{i}(t),\bm{\beta}^{i}(t)\right],&\forall i\in\mathcal{I}^{1}.\end{cases} (28)

Here, 𝐩i(t)=[p11m(t),p12m(t),,pNRKNRm(t)m(t)]\mathbf{p}^{i}(t)=\left[p_{11}^{m}(t),p_{12}^{m}(t),...,p_{N_{\mathrm{R}}K_{N_{\mathrm{R}}}^{m}(t)}^{m}(t)\right] is the downlink power allocation vector of AP agent i0i\in\mathcal{I}^{0}, and 𝝎i(t)=[ω1i(t),ω2i(t),,ωLi(t)]\bm{\omega}^{i}(t)=\left[\omega_{1}^{i}(t),\omega_{2}^{i}(t),...,\omega_{L}^{i}(t)\right] and 𝜷i(t)=[β1i(t),β2i(t),,βLi(t)]\bm{\beta}^{i}(t)=\left[\beta_{1}^{i}(t),\beta_{2}^{i}(t),...,\beta_{L}^{i}(t)\right] denote the dynamic reflecting element selection and the discrete phase-shift control of RIS agent i1i\in\mathcal{I}^{1}, respectively.

III-A3 Decentralized Policy

The joint policy is decomposed into multiple decentralized and parallel local policies π(𝐚iu(t)|𝛀iu(t))\pi(\mathbf{a}_{i}^{u}(t)|\mathbf{\Omega}_{i}^{u}(t)) for each agent iui\in\mathcal{I}^{u}. At each time slot tt, all the agents take local actions 𝐚iu(t)\mathbf{a}_{i}^{u}(t) based on the observed local observations 𝛀iu(t)\mathbf{\Omega}_{i}^{u}(t) to cooperatively maximize the global reward.

III-A4 Global Reward

Let δm(t)=n=1NRk=1Knm(t)[Rnkm,min(t)Rnkm(t)]+\delta^{m}(t)\!\!=\!\sum_{n=1}^{N_{\mathrm{R}}}\sum_{k=1}^{K_{n}^{m}(t)}\!\left[R_{nk}^{m,\min}(t)\!-\!R_{nk}^{m}(t)\right]^{+} be the data rate constraint violation. Then, the global reward can be defined as r(𝐬(t),𝐚(t))=η(t)ξδ(t)r\left(\mathbf{s}(t),\mathbf{a}(t)\right)=\eta(t)-\xi\delta(t), which maximizes the network energy efficiency as well as satisfying data rate requirements. Here, δ(t)=m=1Mδm(t)\delta(t)=\sum_{m=1}^{M}\delta^{m}(t), and ξ\xi is a non-negative parameter that imposes a penalty for data rate violation. However, due to the intractable long-term average queue stability and reliability constraints, it is infeasible to directly use the general MADRL methods to solve the constrained Dec-POMDP problem [24, 25]. Hence, we recast the constrained Dec-POMDP into a standard Dec-POMDP based on the Lyapunov optimization theory [26].

III-B Equivalently Transferred Dec-POMDP

Based on the Markov’s inequality [27], we can obtain the upper bound of the outage probability Pr(qnkm(t)qnkm,max)𝔼[qnkm(t)]/qnkm,max\mathrm{Pr}\left(q_{nk}^{m}(t)\geq q_{nk}^{m,\max}\right)\leq\mathbb{E}\left[q_{nk}^{m}(t)\right]/q_{nk}^{m,\max}. Therefore, the reliability constraint (24) can be strictly guaranteed by ensuring

𝔼[qnkm(t)]qnkm,maxϵnkm,k𝒦.\mathbb{E}\left[q_{nk}^{m}(t)\right]\leq q_{nk}^{m,\max}\epsilon_{nk}^{m},\forall k\in\mathcal{K}. (29)

To tackle the long-term average constraints (23) and (29), we apply the virtual queue technique and Lyapunov optimization theory [26] to recast the intractable constrained Dec-POMDP as a normal Dec-POMDP, which can be solved by general MADRL algorithms. To meet the reliability constraints (29), we construct a virtual queue Ynkm(t)Y_{nk}^{m}(t) to model the instantaneous deviation of queue length constraints for each user UnkmU_{nk}^{m}. Here, YnkmY_{nk}^{m} evolves as follows

Ynkm(t+1)=[Ynkm(t)+(qnkm(t+1)qnkm,maxϵnkm)]+,Y_{nk}^{m}(t+1)=\left[Y_{nk}^{m}(t)+\left(q_{nk}^{m}(t+1)-q_{nk}^{m,\max}\epsilon_{nk}^{m}\right)\right]^{+}, (30)

where we initialize Ynkm(t)=0Y_{nk}^{m}(t)=0 at t=1t=1, k𝒦\forall k\in\mathcal{K}.

Let 𝚵nkm(t)=[qnkm(t),Ynkm(t)]\bm{\Xi}_{nk}^{m}(t)=\left[q_{nk}^{m}(t),Y_{nk}^{m}(t)\right], k𝒦\forall k\in\mathcal{K}. Define 𝚵(t)=[𝚵111(t),𝚵121(t),,𝚵NRKMM(t)]\bm{\Xi}(t)=\left[\bm{\Xi}_{11}^{1}(t),\bm{\Xi}_{12}^{1}(t),...,\bm{\Xi}_{N_{\mathrm{R}}K^{M}}^{M}(t)\right] as the combined queue vector. Then, we introduce the Lyapunov function (𝚵(t))=12𝚵(t)𝚵T(t)\mathcal{L}(\bm{\Xi}(t))=\frac{1}{2}\bm{\Xi}(t)\bm{\Xi}^{T}(t) to measure the satisfaction status of the reliability constraint, which should be maintained under a low value. Moreover, the one-step Lyapunov drift is defined as Δ(t)=(𝚵(t+1))(𝚵(t))\Delta\mathcal{L}(t)=\mathcal{L}(\bm{\Xi}(t+1))-\mathcal{L}(\bm{\Xi}(t)), which can be upper bounded by the following Lemma.

Lemma 1.

Define Δnkm(t)=(𝚵nkm(t+1))(𝚵nkm(t))\Delta{\mathcal{L}}_{nk}^{m}(t)=\mathcal{L}(\bm{\Xi}_{nk}^{m}(t+1))-\mathcal{L}(\bm{\Xi}_{nk}^{m}(t)), then Δnkm(t)\Delta{\mathcal{L}}_{nk}^{m}(t) can be upper bounded by

Δnkm(t)Cnkm+Bnkm(t)+Λnkm(t)(Ankm(t)Rnkm(t)),\Delta{\mathcal{L}}_{nk}^{m}(t)\!\leq\!C_{nk}^{m}+B_{nk}^{m}(t)+\!\Lambda_{nk}^{m}(t)\left(A_{nk}^{m}(t)-R_{nk}^{m}(t)\right), (31)

where Cnkm=2Cnkm,q+Cnkm,YC_{nk}^{m}=2C_{nk}^{m,\mathrm{q}}+C_{nk}^{m,\mathrm{Y}} is a constant, with Cnkm,q=12(Ankm,max)2+12(Rnkm,max)2C_{nk}^{m,\mathrm{q}}=\frac{1}{2}\left(A_{nk}^{m,\max}\right)^{2}+\frac{1}{2}\left(R_{nk}^{m,\max}\right)^{2} and Cnkm,Y=12(qnkm,maxϵnkm)2C_{nk}^{m,\mathrm{Y}}=\frac{1}{2}\left(q_{nk}^{m,\max}\epsilon_{nk}^{m}\right)^{2}. Moreover, Bnkm(t)=12(qnkm(t))2+Ynkm(t)[Ankm(t)+qnkm(t)]B_{nk}^{m}(t)=\frac{1}{2}\left(q_{nk}^{m}(t)\right)^{2}+Y_{nk}^{m}(t)\left[A_{nk}^{m}(t)+q_{nk}^{m}(t)\right] and Λnkm(t)=Ynkm(t)+2qnkm(t)\Lambda_{nk}^{m}(t)=Y_{nk}^{m}(t)+2q_{nk}^{m}(t) are fixed values at each time slot.

Proof.

See Appendix A. ∎

Based on the Lyapunov optimization, the original constrained Dec-POMDP of maximizing the system energy efficiency while ensuring the long-term reliability constraints can be transformed into minimising the following drift-minus-bonus function

Δ(t)ζη(t)+ξδ(t)m=1Mn=1NRk=1Knm(t)[Cnkm+Bnkm(t)+Λnkm(t)(Ankm(t)Rnkm(t))]ζη(t)+ξδ(t),\begin{split}&\Delta\mathcal{L}(t)-\zeta\eta(t)+\xi\delta(t)\leq\sum_{m=1}^{M}\sum_{n=1}^{N_{\mathrm{R}}}\sum_{k=1}^{K_{n}^{m}(t)}\bigg{[}C_{nk}^{m}+B_{nk}^{m}(t)\\ &+\Lambda_{nk}^{m}(t)\left(A_{nk}^{m}(t)-R_{nk}^{m}(t)\right)\bigg{]}-\zeta\eta(t)+\xi\delta(t),\end{split} (32)

where ζ\zeta denotes the positive coefficient that controls the tradeoff between energy efficiency and transmission reliability, and the inequality is implied by Lemma 1 and Δ(t)=m=1Mn=1NRk=1Knm(t)Δnkm(t)\Delta\mathcal{L}(t)=\sum_{m=1}^{M}\sum_{n=1}^{N_{\mathrm{R}}}\sum_{k=1}^{K_{n}^{m}(t)}\Delta\mathcal{L}_{nk}^{m}(t). By ignoring the independent terms from control variables, 𝒫1\mathcal{P}_{1} can be rearranged as

maxπlimT1Tt=1T𝔼π[ζη(t)ξδ(t)+m=1Mn=1NRk=1Knm(t)Λnkm(t)Rnkm(t)],\max_{\pi}\!\!\lim_{T\to\infty}\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}_{\pi}\!\left[\zeta\eta(t)\!-\!\xi\delta(t)\!+\!\sum_{m=1}^{M}\sum_{n=1}^{N_{\mathrm{R}}}\sum_{k=1}^{K_{n}^{m}(t)}\!\Lambda_{nk}^{m}(t)R_{nk}^{m}(t)\right], (33a)
s.t.\displaystyle{\mathrm{s.t.}}~ (25c)(25e).\displaystyle\eqref{constraint_power}-\eqref{constraint_RIS}. (33b)

Based on (33), the local observation of AP i0i\in\mathcal{I}^{0} in (26) can be transformed as

𝛀i0(t)=[{𝐇~ii(t)}i𝒩i0{i},𝚲i(t),𝝉i0(t)],i0,\small\mathbf{\Omega}_{i}^{0}(t)\!=\!\bigg{[}\left\{\bm{\widetilde{\mathrm{H}}}^{ii^{\prime}}(t)\right\}_{i^{\prime}\in\mathcal{N}_{i}^{0}\cup\{i\}}\!,\bm{\Lambda}^{i}(t),\bm{\tau}_{i}^{0}(t)\bigg{]}\!,\forall i\in\mathcal{I}^{0}, (34)

where 𝚲i(t)=[Λ11i(t),Λ12i(t),,ΛNRKni(t)i(t)]\bm{\Lambda}^{i}(t)=\left[\Lambda_{11}^{i}(t),\Lambda_{12}^{i}(t),...,\Lambda_{N_{\mathrm{R}}K_{n}^{i}(t)}^{i}(t)\right]. From (33), the global reward function r(𝐬(t),𝐚(t))r\left(\mathbf{s}(t),\mathbf{a}(t)\right) in the equivalent Dec-POMDP is modified into

r(𝐬(t),𝐚(t))=ζη(t)ξδ(t)+m=1Mn=1NRk=1Knm(t)Λnkm(t)Rnkm(t).\small r\left(\mathbf{s}(t),\mathbf{a}(t)\right)=\zeta\eta(t)-\xi\delta(t)+\sum_{m=1}^{M}\sum_{n=1}^{N_{\mathrm{R}}}\sum_{k=1}^{K_{n}^{m}(t)}\Lambda_{nk}^{m}(t)R_{nk}^{m}(t). (35)

IV GE-VDAC Algorithm for Distributed RIS Configuration and Power Allocation

IV-A Review of MADRL Methods

The standard Dec-POMDP problem can be solved based on general cooperative MADRL algorithms [28],[29]. By exploiting the remarkable representation ability of deep neural network (DNN) for approximating arbitrary non-convex functions, MADRL algorithms can learn the mapping from agents’ observations to actions through exploitation and exploration. However, MADRL algorithms commonly suffer partial observability, environment instability, and credit assignment problems [28]. To cope with environment instability, the multi-actor critic framework based on centralized-training and distributed-execution (CTDE) design is proposed [29], which learns distributed policies with a centralized critics. However, since it utilizes a unified critic value to compute local policy gradients, agents cannot fairly learn to contribute for global optimization, leading to the so-called multiagent credit assignment problem [28], [30], also referred to as indolent agent. Therefore, the idea of different rewards has been introduced, which computes the individual reward for each agent to evaluate their contributions and motivate coordinations. QMIX [31] is the widely utilized mechanism that decomposes the joint state-action value into monotonic individual state-action functions for different agents. To overcome the monotonicity limitation of QMIX, QTRAN with higher generalization of value factorization can be leveraged [32]. Nevertheless, both QMIX and QTRAN as extensions of the deep Q-learning are mainly used to deal with the problems with discrete action spaces. In [33], value-decomposition actor-critic (VDAC) is introduced to incorporate the value decomposition into the advantage actor-critic (A2C) framework, which has continuous action space and higher sample efficiency.

In our proposed framework, it’s critical to facilitate coordination among multiple APs and RISs to mitigate interference, improve system energy efficiency, and guarantee diversified QoS requirements. Under the realistic partially observable environment, although the joint design of RIS configuration and power allocation has been decomposed into distributed local policies, information still need to be exchanged among neighboring agents to achieve fully coordination. However, considering the high-dimensional CSI information resulting from the massive MIMO structure, directly exchanging information among interacting neighboring agents during each execution can cause high communication overhead and latency. Therefore, the existing MADRL algorithms are still inefficient to solve the highly coupled Dec-POMDP problem. In this section, we propose GE-VDAC, a novel MADRL algorithm that can tackle the multiagent credit assignment as well as the above problems.

IV-B The Proposed GE-VDAC Framework

GE-VDAC extends the commonly-used CTDE design in the existing MADRL algorithms, which realizes more efficient cooperative learning by integrating two techniques, i.e., graph embedding and different rewards. The interplay among interacting agents are modeled as a directed communication graph. Instead of directly exchanging high-dimensional information, the neighboring agents exchange low-dimensional embedded features learned by graph embedding, thus requiring fewer information exchange to achieve efficient coordination under partially observable environment. Moreover, the graph-embedded features possess permutation-invariant property. By learning distributed DRL policies over the graph-embedded observation space that has reduced dimension and enjoys permutation-invariant property, the learning speed and the generalization ability can be improved.

Refer to caption
Figure 3: Procedure of the proposed GE-VDAC algorithm.

As shown in Fig. 3, the GE-VDAC jointly trains three neural networks, i.e., the distributed graph-embedded actors, the distributed critics, and the centralized mixing network. For simplicity, we assume the agents belonging to the same type uu share the neural network parameters of both the distributed actor πu\pi^{u} and critic VuV^{u}. The distributed graph-embedded actor contains a semi-distributed graph embedding module learning graph-embedded features for efficient information exchange, in conjunction with a fully-distributed action generation module that predicts local action based on the local embedded state attained by the graph embedding module. The graph embedding module is implemented as message passing graph neural network (MPGNN) [34],[35]. Moreover, the action generation module mainly comprises a gated recurrent unit (GRU) [36], which is a simplified variant of long-short term memory (LSTM) that can achieve comparative performance [37]. Two fully-connected (FC) layers are connected with the GRU before and after, respectively. On the other hand, the distributed critic evaluates an individual value of the local embedded state, which achieves differential reward to judge each agent’s contribution for global optimization. Meanwhile, a global mixing network is trained to combine the distributed critics, which guarantees that equivalent policy improvement can be achieved based on the value decomposition.

IV-C Graph-Embedded Actor

IV-C1 Graph Modeling

To characterize the interplay among neighboring agents, we model the THz reconfigurable massive MIMO-NOMA networks as a directed communication graph 𝒢=[,,𝒯N,𝒯E]\mathcal{G}=\left[\mathcal{I},\mathcal{E},\mathcal{T}_{\mathrm{N}},\mathcal{T}_{\mathrm{E}}\right], where agents are modeled as two types of heterogeneous nodes \mathcal{I}, and interplay among agents are modeled as edges \mathcal{E}. 𝒯N:dNu\mathcal{T}_{\mathrm{N}}:\mathcal{I}\rightarrow\mathbb{C}^{d_{\mathrm{N}}^{u}} maps a node ii to its dNud_{\mathrm{N}}^{u}-dimensional node feature 𝐨iu(t)\mathbf{o}_{i}^{u}(t). Denote tuple (i,i)(i,i^{\prime}) as a directed edge from the source node ii to the destination node ii^{\prime}. 𝒯E:dEii\mathcal{T}_{\mathrm{E}}:\mathcal{E}\rightarrow\mathbb{C}^{d_{\mathrm{E}}^{ii^{\prime}}} maps an edge (i,i)(i,i^{\prime}) to the dEiid_{\mathrm{E}}^{ii^{\prime}}-dimensional edge feature 𝐝ii(t)\mathbf{d}_{ii^{\prime}}(t). Both 𝐨iu(t)\mathbf{o}_{i}^{u}(t) and 𝐝ii(t)\mathbf{d}_{ii^{\prime}}(t) are extracted from the local observation of node iui\in\mathcal{I}^{u}.

The node feature of a AP node ii\in\mathcal{M} contains the spatial channel information from AP ii to its associated users, the queue information of its associated users, and the local action-observation history of AP ii, defined as

𝐨i0(t)=[vec(𝐇~ii(t)),𝚲i(t),𝝉i0(t)],i0.\mathbf{o}_{i}^{0}(t)=\left[\mathrm{vec}\left(\bm{\widetilde{\mathrm{H}}}^{ii}(t)\right),\bm{\Lambda}^{i}(t),\bm{\tau}_{i}^{0}(t)\right],\forall i\in\mathcal{I}^{0}. (36)

Moreover, the node feature of a RIS node i1i\in\mathcal{I}^{1} includes the local action-observation history, i.e., 𝐨i1(t)=𝝉i1(t)\mathbf{o}_{i}^{1}(t)=\bm{\tau}_{i}^{1}(t).

The edge feature 𝐝ii\mathbf{d}_{ii^{\prime}} depicts the interplay effect of agent ii to ii^{\prime}, which can be mathematically denoted as

𝐝ii(t)={[{vec(𝐇~im(t))}m𝒩1i],i0,i1, and i𝒩i0,[vec(𝐆ii(t)),vec(𝐅ii(t))],i1,i0, and i𝒩j1,vec(𝐇~ii(t)),i,i0, and i𝒩i0,𝟎,otherwise.\small\begin{split}&\mathbf{d}_{ii^{\prime}}(t)=\\ &\begin{cases}\left[\left\{\mathrm{vec}\left(\widetilde{\mathbf{H}}^{im}(t)\right)\right\}_{m\in\mathcal{N}_{1}^{i^{\prime}}}\right],&i\in\mathcal{I}^{0},i^{\prime}\in\mathcal{I}^{1}\text{, and }i^{\prime}\in\mathcal{N}_{i}^{0},\\ \left[\mathrm{vec}\left(\mathbf{G}^{ii^{\prime}}(t)\right),\mathrm{vec}\left(\mathbf{F}^{ii^{\prime}}(t)\right)\right],&i\in\mathcal{I}^{1},i^{\prime}\in\mathcal{I}^{0}\text{, and }i^{\prime}\in\mathcal{N}_{j}^{1},\\ \mathrm{vec}\left(\bm{\widetilde{\mathrm{H}}}^{ii^{\prime}}(t)\right),&i,i^{\prime}\in\mathcal{I}^{0}\text{, and }i^{\prime}\in\mathcal{N}_{i}^{0},\\ \mathbf{0},&\text{otherwise}.\end{cases}\end{split} (37)

Specifically, for an AP-RIS link, the edge feature is the spatial channels from AP ii to the users served by neighboring APs of RIS ii^{\prime}. For an AP-AP link, the edge feature denotes the channel from AP ii to the users associated with AP ii^{\prime}. For a RIS-AP link, the edge feature includes the channels between RIS ii and both AP ii^{\prime} and its associated users.

To learn the structured representation of the THz reconfigurable massive MIMO-NOMA networks, we integrate MPGNN [35] into the distributed actor/critic networks for embedding the graph. Therefore, node/edge features of the directed graph over high-dimensional joint state space can be embedded into permutation-invariant features over low-dimensional space. By exchanging the learned low-dimensional and permutation-invariant embedded features among neighboring agents, it’s capable to improve generalization and enhance coordination among APs and RISs, while only requiring low information exchange overhead.

IV-C2 Graph-Embedded Actor

The graph-embedded actor (policy) comprises a semi-distributed graph embedding module which learns the embedding of the graph, followed by a distributed action generation module that outputs action by taking the local embedded sate as inputs. Here, we define πθG\pi_{\theta_{\mathrm{G}}} as the graph embedding sub-policy, and define πθAu\pi_{\theta_{\mathrm{A}}}^{u}, u\forall u, as the action generation sub-policies, which are parameterized as MPGNNs and GRUs, respectively. Let 𝜽Gu\bm{\theta}_{\mathrm{G}}^{u} and 𝜽Au\bm{\theta}_{\mathrm{A}}^{u} denote the MPGNN and GRU weight parameters shared among agents iui\in\mathcal{I}^{u}, respectively.

We maintain a MPGNN at each distributed node iui\in\mathcal{I}^{u}. Similar to the multi-layer perceptron (MLP), MPGNN exploits a layer-wise structure. In each MPGNN layer, each agent first transmits embedded information to its neighbor agents, and then aggregates embedded information from neighbor agents and updates its local hidden state.

Define the edges (i,i)(i,i^{\prime}) and (i,i)(i^{\prime},i) as the outbound and inbound edges of node ii. Let 𝒩i+u\mathcal{N}_{i+}^{u} and 𝒩iu\mathcal{N}_{i-}^{u} denote the sets of neighbor agents that are linked with agent ii through outbound and inbound edges of agent ii, respectively. In layer nn, each agent iui\in\mathcal{I}^{u} locally embeds its node feature and the outbound edge feature 𝐝ii(t)\mathbf{d}_{ii^{\prime}}(t) as

𝐞ii(n)(t)=ψu(n)(𝐳ui(n1)(t),𝐝ii(t)),i𝒩i+u,iu,\mathbf{e}_{ii^{\prime}}^{(n)}(t)=\psi_{u}^{(n)}\left(\mathbf{z}_{ui}^{(n-1)}(t),\mathbf{d}_{ii^{\prime}}(t)\right),\forall i^{\prime}\in\mathcal{N}_{i+}^{u},i\in\mathcal{I}^{u}, (38)

where 𝐳ui(n1)(t)\mathbf{z}_{ui}^{(n-1)}(t) denotes the hidden state of MPGNN layer n1n-1 at agent iui\in\mathcal{I}^{u}, and ψu()\psi_{u}(\cdot) is the distributed embedding function. Thereafter, each agent ii transmits the outbound embedded feature 𝐞ii(n)(t)\mathbf{e}_{ii^{\prime}}^{(n)}(t) to its neighbor agents i𝒩i+ui^{\prime}\in\mathcal{N}_{i+}^{u}, and receives the inbound embedded feature 𝐞ii(n)(t)\mathbf{e}_{i^{\prime}i}^{(n)}(t) from i𝒩iui^{\prime}\in\mathcal{N}_{i-}^{u}. Thereafter, the received embedded features are aggregated with a aggregation function ϕu()\phi_{u}(\cdot), which is a permutation-invariant function such as sum()sum(\cdot), mean()mean(\cdot) and max()max(\cdot). By combining the local hidden state and the aggregated features using the combining function Ψu()\Psi_{u}(\cdot), agent ii obtains the hidden state 𝐳ui(n)(t)\mathbf{z}_{ui}^{(n)}(t) of layer nn as

𝐳ui(n)(t)=Ψu(n)(𝐳ui(n1)(t),ϕu(n)({𝐞ii(n)}i𝒩iu(t))),iu.\mathbf{z}_{ui}^{(n)}(t)\!=\!\Psi_{u}^{(n)}\!\left(\!\mathbf{z}_{ui}^{(n-1)}(t),\phi_{u}^{(n)}\left(\!\left\{\mathbf{e}_{i^{\prime}i}^{(n)}\right\}_{i^{\prime}\in\mathcal{N}_{i-}^{u}}(t)\!\right)\!\right)\!,\forall i\in\mathcal{I}^{u}. (39)

Combining (38) and (39), the update rule of the nn-th MPGNN layer at node iui\in\mathcal{I}^{u} can be rewritten as

𝐳ui(n)(t)=Ψu(n)(𝐳ui(n1)(t),ϕu(n)({ψu(n)(𝐳ui(n1)(t),𝐝ii(t))}i𝒩iu)),\small\begin{split}&\mathbf{z}_{ui}^{(n)}(t)\!=\\ &\!\Psi_{u}^{(n)}\!\left(\!\mathbf{z}_{ui}^{(n-1)}(t),\phi_{u}^{(n)}\!\left(\!\left\{\psi_{u}^{(n)}\!\left(\!\mathbf{z}_{ui^{\prime}}^{(n-1)}(t),\mathbf{d}_{i^{\prime}i}(t)\!\right)\!\right\}_{i^{\prime}\in\mathcal{N}_{i-}^{u}}\!\right)\!\right)\!,\end{split} (40)

where 𝐳ui(n)(t)\mathbf{z}_{ui}^{(n)}(t) is the hidden/output state at nn-th layer for agent iui\in\mathcal{I}^{u}, and 𝐳ui(0)(t)=𝐨iu(t)\mathbf{z}_{ui}^{(0)}(t)=\mathbf{o}_{i}^{u}(t).

After the graph embedding module, agent iui\in\mathcal{I}^{u} would predict local action utilizing GRU based on the output local embedded state 𝐳~iu(t)\widetilde{\mathbf{z}}_{i}^{u}(t), which is given by

𝐳~iu(t)=[𝐨iu(t),𝐳ui(N)(t)].\widetilde{\mathbf{z}}_{i}^{u}(t)=\left[\mathbf{o}_{i}^{u}(t),\mathbf{z}_{ui}^{(N)}(t)\right]. (41)

The local action 𝐚iu(t)\mathbf{a}_{i}^{u}(t) for each agent iui\in\mathcal{I}^{u} is sampled from the action generation sub-policy πθAu(𝐚iu(t)|𝐳~iu(t))\pi_{\theta_{\mathrm{A}}}^{u}\left(\mathbf{a}_{i}^{u}(t)|\widetilde{\mathbf{z}}_{i}^{u}(t)\right).

Algorithm 1 Coordinated Multi-AP and Multi-RIS Scheduling Based on Distributed Graph-Embedded Actors
0: Distributed actors πθu,u{0,1}\pi_{\theta}^{u},\forall u\in\{0,1\}.
1: Initialize t=1t=1, qnkm(t)=0q_{nk}^{m}(t)=0, Ynkm(t)=0Y_{nk}^{m}(t)=0, k𝒦\forall k\in\mathcal{K};
2:for t=1,2,,Tt=1,2,...,T do
3:  Each agent obtains local observation as (34) and (27);
4:  for each MPGNN layer n{1,2,,N}n\in\{1,2,...,N\} do
5:   Each agent iui\in\mathcal{I}^{u} calculates local feature embedding 𝐞ii(n)(t),i𝒩i+u\mathbf{e}_{ii^{\prime}}^{(n)}(t),\forall i^{\prime}\in\mathcal{N}_{i+}^{u}, using (38);
6:   Each agent iui\!\in\!\mathcal{I}^{u} aggregates feature embedding 𝐞ii(n)(t)\mathbf{e}_{i^{\prime}i}^{(n)}(t) from neighbor agents i𝒩iui^{\prime}\!\in\!\mathcal{N}_{i-}^{u}, and update the hidden state 𝐳ui(n)(t)\mathbf{z}_{ui}^{(n)}(t) based on (39);
7:  end for
8:  Each agent iui\in\mathcal{I}^{u} samples local action 𝐚iu(t)\mathbf{a}_{i}^{u}(t) according to πθAu(𝐚iu(t)|𝐳~iu(t))\pi_{\theta_{\mathrm{A}}}^{u}\left(\mathbf{a}_{i}^{u}(t)|\widetilde{\mathbf{z}}_{i}^{u}(t)\right);
9:  Each AP mm calculates QoS-based user clustering, obtains hybrid beamformer 𝐕m(t)\mathbf{V}^{m}(t) and 𝐖m(t)\mathbf{W}^{m}(t), and decides decoding order of IoT users for each user cluster;
10:  Observe the global reward r(𝐬(t),𝐚(t))r\left(\mathbf{s}(t),\mathbf{a}(t)\right), and calculate qnkm(t+1)q_{nk}^{m}(t+1) and Ynkm(t+1)Y_{nk}^{m}(t+1) according to (23) and (30);
11:end for
11:{𝐬(t),𝐳~(t),𝐚(t),r(𝐚(t),𝐬(t)),𝐬(t+1)}t{1,2,,T}\left\{\mathbf{s}(t),\widetilde{\mathbf{z}}(t),\mathbf{a}(t),r\left(\mathbf{a}(t),\mathbf{s}(t)\right),\mathbf{s}(t+1)\right\}_{t\in\{1,2,...,T\}}

The coordinated scheduling for APs and RISs based on the distributed graph-embedded actors can be summarized as Algorithm 1. At each time slot tt, APs and RISs exchange embedded features and obtain their local embedded state 𝐳~iu\widetilde{\mathbf{z}}_{i}^{u} through MPGNN. Thereafter, each agent iui\in\mathcal{I}^{u} samples its local action based on the action generation sub-policy πθAu(𝐚iu(t)|𝐳~iu(t))\pi_{\theta_{\mathrm{A}}}^{u}\left(\mathbf{a}_{i}^{u}(t)|\widetilde{\mathbf{z}}_{i}^{u}(t)\right) of the local actor. By observing the reconfigured channels, each AP further decides the user clustering, hybrid beamforming and IoT user decoding order. In this way, the customized user clustering, NOMA decoding and hybrid beamforming schemes are considered as part of the environment feedback, which would be implicitly learned by the distributed actors through exploration and exploitation. The experiences are then joined into the replay memory buffer \mathcal{B} for centralized learning.

IV-C3 Policy improvement

Denote the combined parameters of graph embedding module and action generation module in the distributed actor (policy) as 𝜽πu=[𝜽Gu,𝜽Au]\bm{\theta}_{\pi}^{u}=\left[\bm{\theta}_{\mathrm{G}}^{u},\bm{\theta}_{\mathrm{A}}^{u}\right]. Here, the distributed actors 𝜽π=[{𝜽πu}u{0,1}]\bm{\theta}_{\pi}=[\{\bm{\theta}_{\pi}^{u}\}_{u\in\{0,1\}}] are trained to maximize the following performance function:

Lπ(𝜽πu)=𝔼𝐬pπ,𝐚π[r(𝐬(t),𝐚(t))],L_{\pi}(\bm{\theta}_{\pi}^{u})=\mathbb{E}_{\mathbf{s}\sim p^{\pi},\mathbf{a}\sim\pi}\left[r(\mathbf{s}(t),\mathbf{a}(t))\right], (42)

where pπp^{\pi} is the joint state transition by following the joint policy π\pi. Therefore, we calculate policy gradient based on the advantage function, which is given by

Δ𝜽πu(t)=i𝜽πulogπu(𝐚iu(t)|𝐬^iu(t))A(𝐬(t),𝐚(t)),\Delta\bm{\theta}_{\pi}^{u}(t)=\sum_{i}\nabla_{\bm{\theta}_{\pi}^{u}}\log\pi^{u}\left(\mathbf{a}_{i}^{u}(t)|\hat{\mathbf{s}}_{i}^{u}(t)\right)A\left(\mathbf{s}(t),\mathbf{a}(t)\right), (43)

where 𝐬^iu(t)\hat{\mathbf{s}}_{i}^{u}(t)is the actual input of the graph-embedded actor. Moreover, A(𝐬(t),𝐚(t))A\left(\mathbf{s}(t),\mathbf{a}(t)\right) denotes the temporal difference (TD) advantage, which is given by

A(𝐬(t),𝐚(t))=Q(𝐬(t),𝐚(t))Vtot(𝐬t)=r(𝐬(t),𝐚(t))+ΓVtot(𝐬t+1)Vtot(𝐬t).\begin{split}A\left(\mathbf{s}(t),\mathbf{a}(t)\right)&=Q(\mathbf{s}(t),\mathbf{a}(t))-V_{tot}(\mathbf{s}_{t})\\ &=r(\mathbf{s}(t),\mathbf{a}(t))+\Gamma V_{tot}(\mathbf{s}_{t+1})-V_{tot}(\mathbf{s}_{t}).\end{split} (44)

Here, Vtot(𝐬t)V_{tot}(\mathbf{s}_{t}) and Q(𝐬(t),𝐚(t))Q(\mathbf{s}(t),\mathbf{a}(t)) denote the global state value and the global state-action value, respectively.

IV-D Value Decomposition

To address credit assignment problem during training, we rely on the value-decomposition critics [33] to train the distributed actors, which decomposes the global state value Vtot(𝐬(t))V_{\mathrm{tot}}(\mathbf{s}(t)) into local critics that are combined with a mixing function fmixf^{\mathrm{mix}} as

Vtot(𝐬(t))=fmix(𝐬(t),V0(𝐳~10(t)),V0(𝐳~20(t)),,V0(𝐳~M0(t)),V1(𝐳~11(t)),V1(𝐳~21(t)),,V1(𝐳~J1(t))),\small\begin{split}V_{\mathrm{tot}}(\mathbf{s}(t))=&f^{\mathrm{mix}}\bigg{(}\mathbf{s}(t),V^{0}\left(\widetilde{\mathbf{z}}_{1}^{0}(t)\right),V^{0}\left(\widetilde{\mathbf{z}}_{2}^{0}(t)\right),...,V^{0}\left(\widetilde{\mathbf{z}}_{M}^{0}(t)\right),\\ &V^{1}\left(\widetilde{\mathbf{z}}_{1}^{1}(t)\right),V^{1}\left(\widetilde{\mathbf{z}}_{2}^{1}(t)\right),...,V^{1}\left(\widetilde{\mathbf{z}}_{J}^{1}(t)\right)\bigg{)},\end{split} (45)

where Vu(𝐳~iu(t))V^{u}\left(\widetilde{\mathbf{z}}_{i}^{u}(t)\right) is the local state value for agent iui\in\mathcal{I}^{u}.

Now, we are ready to introduce the following Lemma.

Lemma 2.

Any action 𝐚iu\mathbf{a}_{i}^{u} that can improve agent ii’s local critic value at local embedded state 𝐳~iu\widetilde{\mathbf{z}}_{i}^{u} will also improve the global state value VtotV_{\mathrm{tot}}, if the other agents stay at the same local embedded states taking actions 𝐚iu\mathbf{a}_{-i}^{u}, and the following relationship holds

Vtot(𝐬(t))Vu(𝐳~iu(t))0,iu,u{0,1}.\frac{\partial V_{\mathrm{tot}}\left(\mathbf{s}(t)\right)}{\partial V^{u}\left(\widetilde{\mathbf{z}}_{i}^{u}(t)\right)}\geq 0,\forall i\in\mathcal{I}^{u},u\in\{0,1\}. (46)
Proof.

Based on (46), the global reward monotonically increase with Vu(𝐳~iu(t))V^{u}\left(\widetilde{\mathbf{z}}_{i}^{u}(t)\right) when the other agents stay at the same local embedded states taking actions 𝐚iu(t)\mathbf{a}_{-i}^{u}(t). Therefore, if a local action 𝐚iu(t)\mathbf{a}_{i}^{u}(t) is capable to improve Vu(𝐳~iu(t))V^{u}\left(\widetilde{\mathbf{z}}_{i}^{u}(t)\right), obviously it can also improve Vtot(𝐬(t))V_{\mathrm{tot}}\left(\mathbf{s}(t)\right). ∎

Remark 1.

Based on Lemma 2, it’s rational to approximately decompose the high-complexity global critic into distributed critics combined by a mixing network with non-negative network weight parameters, thus reducing complexity and achieving difference rewards.

During centralized training, each agent attains a differential reward Vu(𝐳~iu(t))V^{u}\left(\widetilde{\mathbf{z}}_{i}^{u}(t)\right) based on the local graph-embedded features to evaluate its contribution for global reward improvement, which can further facilitate agents’ coordination. The weights of mixing network are generated through separate hypernetworks [38]. Each hypernetwork takes the joint state and state-value as input to compute the weights of one layer of the mixing network [31]. To guarantee that the output weights are non-negative, the absolute activation function is exploited at the hypernetworks.

Define 𝜽Vu\bm{\theta}_{\mathrm{V}}^{u} as the weight parameters of distributed critic VθuV_{\theta}^{u} that are shared among agents iui\in\mathcal{I}^{u}, and denote 𝝁\bm{\mu} as the weights of the mixing network fμmixf_{\mu}^{\mathrm{mix}}. The distributed critic and mixing network are optimized by mini-batch gradient descent to minimize the following critic loss:

LV(𝜽Vu(t),𝝁(t))=[r^t(𝐬(t),𝐚(t))Vtot(𝐬(t))]2=[r^t(𝐬(t),𝐚(t))fμmix(𝐬(t),Vθ0(𝐳~10(t)),,Vθ1(𝐳~J1(t)))]2,\begin{split}&L_{V}\left(\bm{\theta}_{\mathrm{V}}^{u}(t),\bm{\mu}(t)\right)=\left[\hat{r}_{t}\left(\mathbf{s}(t),\mathbf{a}(t)\right)-V_{\mathrm{tot}}\left(\mathbf{s}(t)\right)\right]^{2}\\ \!&=\!\bigg{[}\hat{r}_{t}\left(\mathbf{s}(t),\mathbf{a}(t)\right)-\!\!f_{\mu}^{\mathrm{mix}}\!\bigg{(}\!\mathbf{s}(t),\!V_{\theta}^{0}\left(\widetilde{\mathbf{z}}_{1}^{0}(t)\right),\!...,\!V_{\theta}^{1}\left(\widetilde{\mathbf{z}}_{J}^{1}(t)\right)\!\bigg{)}\!\bigg{]}^{2}\!,\end{split} (47)

where r^t(𝐬(t),𝐚(t))=i=1nΓi1rt+i+ΓnVtot(𝐬t+n)\hat{r}_{t}\left(\mathbf{s}(t),\mathbf{a}(t)\right)=\sum_{i=1}^{n}\Gamma^{i-1}r_{t+i}+\Gamma^{n}V_{\mathrm{tot}}\left(\mathbf{s}_{t+n}\right) is the nn-step return bootstrapped from the last state, with nn upper-bounded by TT. Therefore, the mixing networks can be updated by

𝝁𝝁κμt𝝁LV(𝜽Vu(t),𝝁(t)),\bm{\mu}\leftarrow\bm{\mu}-\kappa_{\mu}\sum_{t}\nabla_{\bm{\mu}}L_{\mathrm{V}}\left(\bm{\theta}_{\mathrm{V}}^{u}(t),\bm{\mu}(t)\right), (48)

where κμ\kappa_{\mu} is the learning rate for mixing network update.

To reduce complexity, we further share the weight parameters of non-output layers between distributed critics and actors for agents of the same type, similar to [33]. Denote the combined weight parameters of the distributed actors and critics networks as 𝜽u=[𝜽Gu,𝜽Au,𝜽Vu]\bm{\theta}^{u}=\left[\bm{\theta}_{\mathrm{G}}^{u},\bm{\theta}_{\mathrm{A}}^{u},\bm{\theta}_{\mathrm{V}}^{u}\right]. Therefore, the critic gradient with respect to 𝜽u\bm{\theta}^{u} can be given by

Δ𝜽Vu(t)=𝜽uLV(𝜽Vu(t),𝝁(t)).\Delta\bm{\theta}_{\mathrm{V}}^{u}(t)=\nabla_{\bm{\theta}^{u}}L_{\mathrm{V}}\left(\bm{\theta}_{\mathrm{V}}^{u}(t),\bm{\mu}(t)\right). (49)

Hence, the update rule of the distributed actor/critic networks can be derived as

𝜽u𝜽u+t(κπΔ𝜽πu(t)κVΔ𝜽Vu(t)),\bm{\theta}^{u}\leftarrow\bm{\theta}^{u}+\sum_{t}\left(\kappa_{\pi}\Delta\bm{\theta}_{\pi}^{u}(t)-\kappa_{\mathrm{V}}\Delta\bm{\theta}_{\mathrm{V}}^{u}(t)\right), (50)

where κπ\kappa_{\pi} and κV\kappa_{\mathrm{V}} are the learning rate for policy improvement and critic learning, respectively.

The whole GE-VDAC based MADRL algorithm is summarized in Algorithm 2.

Algorithm 2 GE-VDAC Based MADRL Algorithm
1: Initialize the parameters of distributed actors/critics 𝜽u,u\bm{\theta}^{u},\forall u, and the mixing network 𝝁\bm{\mu};
2: Initialize the learning rates κμ,κπ,κV\kappa_{\mu},\kappa_{\pi},\kappa_{\mathrm{V}};
3:for each training episode do
4:  Initialize replay buffer =\mathcal{B}=\emptyset;
5:  for each semi-distributed execution stage do
6:   Execute the semi-distributed multi-AP multi-RIS scheduling for TT steps based on algorithm 1;
7:   Add experiences {𝐬(t),𝐳~(t),𝐚(t),r(𝐚(t),𝐬(t)),𝐬(t+1)}t{1,2,,T}\big{\{}\mathbf{s}(t),\widetilde{\mathbf{z}}(t),\mathbf{a}(t),r\left(\mathbf{a}(t),\mathbf{s}(t)\right),\mathbf{s}(t+1)\big{\}}_{t\in\{1,2,...,T\}} to replay buffer \mathcal{B};
8:  end for
9:  Draw a batch of experiences from buffer \mathcal{B};
10:  Calculate the policy gradient 𝜽πu,u\bm{\theta}_{\pi}^{u},\forall u, based on (43);
11:  Calculate the critic loss based on (47);
12:  Update mixing network weights 𝝁\bm{\mu} based on (48);
13:  Calculate the critic gradients Δ𝜽Vu,u\Delta\bm{\theta}_{\mathrm{V}}^{u},\forall u, based on (49);
14:  Update distributed graph-embedded actors and critics based on (50).
15:end for

IV-E Theoretical Analysis of GE-VDAC

IV-E1 Permutation invariance

Firstly, we show that GE-VDAC has permutation-invariant property, which can lead to better generalization. Consider a graph 𝒢\mathcal{G}, whose node and edge features are denoted by 𝐎\mathbf{O} and 𝐃\mathbf{D}. Let ν\nu denote the permutation operator. Let ν\nu\star\mathcal{I} and ν(𝐎,𝐃)\nu\star\left(\mathbf{O},\mathbf{D}\right) denote the permutation of nodes (agents) and graph features, respectively. Given two graphs 𝒢\mathcal{G} and 𝒢\mathcal{G}^{\prime}, if their exists a permutation ν\nu satisfying (𝐎,𝐃)=ν(𝐎,𝐃)\left(\mathbf{O},\mathbf{D}\right)=\nu\star\left(\mathbf{O}^{\prime},\mathbf{D}^{\prime}\right), then 𝒢\mathcal{G} and 𝒢\mathcal{G}^{\prime} are isomorphic, denoted as 𝒢𝒢\mathcal{G}\cong\mathcal{G}^{\prime}.

Definition 1.

Given two isomorphic communication graphs 𝒢𝒢\mathcal{G}\cong\mathcal{G}^{\prime} with permutation (𝐎,𝐃)=ν(𝐎,𝐃)\left(\mathbf{O},\mathbf{D}\right)=\nu\star\left(\mathbf{O}^{\prime},\mathbf{D}^{\prime}\right), the distributed actors/critics are permutation invariant if the output joint action vector 𝐚,𝐚\mathbf{a},\mathbf{a}^{\prime} and joint critic vector 𝐯,𝐯\mathbf{v},\mathbf{v}^{\prime} satisfy

𝐚=ν𝐚,𝐯=ν𝐯.\mathbf{a}=\nu\star\mathbf{a}^{\prime},\mathbf{v}=\nu\star\mathbf{v}^{\prime}. (51)

Traditional centralized/multi-agent actor-critic algorithms usually exchange information directly and utilize neural networks such as MLP, convolutional neural network (CNN) and recurrent neural network (RNN) to learn actor and critic, which cannot achieve the permutation invariance property. Despite a sample and all of its agents’ permutations actually represent the same environments, they would be viewed as utterly different samples. Since traditional algorithms cannot adapt to agents’ permutations intelligently, all permutations of one sample should be fed to train actors and critics, leading to poor generalization and learning speed. However, by extending the traditional DRL algorithms, the graph embedding based distributed actors and critics in GE-VDAC enjoy the permutation invariance property, as shown below.

Proposition 1.

If the aggregation functions ϕu(),u\phi_{u}(\cdot),\forall u, utilized in (39) and (40) are permutation invariant, the distributed actors/critics based on GE-VDAC satisfy permutation invariance.

Proof.

Considering the update rule of hidden states in (40), if aggregation functions ϕu(),u\phi_{u}(\cdot),\forall u, are permutation invariant, the graph embedding features obtained by MPGNN πθGu,u\pi_{\theta_{\mathrm{G}}}^{u},\forall u, enjoy permutation invariant property [35], i.e., πθGu(ν𝐎,ν𝐃)=ν(πθGu(𝐎,𝐃))\pi_{\theta_{\mathrm{G}}}^{u}\left(\nu\star\mathbf{O},\nu\star\mathbf{D}\right)=\nu\star\left(\pi_{\theta_{\mathrm{G}}}^{u}\left(\mathbf{O},\mathbf{D}\right)\right). Given agents’ permutation (𝐎,𝐃)=(ν𝐎,ν𝐃)\left(\mathbf{O}^{\prime},\mathbf{D}^{\prime}\right)=\left(\nu\star\mathbf{O},\nu\star\mathbf{D}\right), we have

(𝐳u)=πθGu(ν𝐎,ν𝐃)=ν(πθGu(𝐎,𝐃))=ν𝐳u,u.\left(\mathbf{z}^{u}\right)^{\prime}=\pi_{\theta_{\mathrm{G}}}^{u}\left(\nu\star\mathbf{O},\nu\star\mathbf{D}\right)=\nu\star\left(\pi_{\theta_{\mathrm{G}}}^{u}\left(\mathbf{O},\mathbf{D}\right)\right)=\nu\star\mathbf{z}^{u},\forall u. (52)

Therefore, we can obtain (𝐳~iu)=𝐳~ν(i)u,i,u\left(\widetilde{\mathbf{z}}_{i}^{u}\right)^{\prime}=\widetilde{\mathbf{z}}_{\nu(i)}^{u},\forall i,\forall u, which implies the permutation equivalence of the local embedded states. Considering the definition of the local critics, we can achieve

𝐯={Vθu((𝐳~iu))}i(ν)={Vθu(𝐳~ν(i)u)}i(ν)=ν𝐯.\mathbf{v}^{\prime}=\left\{V_{\theta}^{u}\left(\left(\widetilde{\mathbf{z}}_{i}^{u}\right)^{\prime}\right)\right\}_{i\in\left(\nu\star\mathcal{I}\right)}=\left\{V_{\theta}^{u}\left(\widetilde{\mathbf{z}}_{\nu(i)}^{u}\right)\right\}_{i\in\left(\nu\star\mathcal{I}\right)}=\nu\star\mathbf{v}. (53)

Moreover, since the local actions are sampled from 𝐚ν(i)uπθAu(|𝐳~ν(i)u)\mathbf{a}_{\nu(i)}^{u}\sim\pi_{\theta_{\mathrm{A}}}^{u}\big{(}\cdot|\widetilde{\mathbf{z}}_{\nu(i)}^{u}\big{)} and (𝐚iu)πθAu(|(𝐳~iu))\left(\mathbf{a}_{i}^{u}\right)^{\prime}\sim\pi_{\theta_{\mathrm{A}}}^{u}\big{(}\cdot|\left(\widetilde{\mathbf{z}}_{i}^{u}\right)^{\prime}\big{)}, we have

πθAu((𝐚iu)|(𝐳~iu))=πθAu(𝐚ν(i)u|𝐳~ν(i)u),iu,u{0,1},\!\pi_{\theta_{\mathrm{A}}}^{u}\!\left(\left(\mathbf{a}_{i}^{u}\right)^{\prime}|\left(\widetilde{\mathbf{z}}_{i}^{u}\right)^{\prime}\right)\!\!=\!\pi_{\theta_{\mathrm{A}}}^{u}\big{(}\mathbf{a}_{\nu(i)}^{u}|\widetilde{\mathbf{z}}_{\nu(i)}^{u}\big{)},\forall i\!\in\!\mathcal{I}^{u},u\!\in\!\{0,\!1\}, (54)

which signifies 𝐚=ν𝐚\mathbf{a}^{\prime}=\nu\star\mathbf{a}. This ends the proof. ∎

Based on Proposition 1, the learned distributed actors/critics in GE-VDAC are insusceptible to the agents’ permutations. Therefore, GE-VDAC can improve the generalization and the learning speed in multi-agent cooperative learning.

IV-E2 Convergence Guarantee

We also demonstrate that the proposed GE-VDAC algorithm can achieve the locally optimal policy. The convergence of the GE-VDAC algorithm can be theoretically guaranteed with the following theorem.

Theorem 1.

The proposed algorithm can converge to a locally optimal policy for the Dec-POMDP, i.e.,

liminft𝜽πLπ(𝜽π)=0,w.p.1.\lim\inf_{t}\|\nabla_{\bm{\theta}_{\pi}}L_{\pi}(\bm{\theta}_{\pi})\|=0,w.p.1. (55)
Proof.

See Appendix B. ∎

IV-E3 Complexity Analysis

The computational complexity of the proposed GE-VDAC algorithm mainly comes from MPGNNs and GRUs. For each MPGNN, the local embedding function ψ()\psi(\cdot) and the combining function Ψ()\Psi(\cdot) are parameterized as two-layer MLPs with SψS_{\psi} and SΨS_{\Psi} neurons in each MLP layer, respectively. Thus, the computational complexity of local embedding, aggregation, and combination at agent iui\in\mathcal{I}^{u} in MPGNN layer nn can be respectively written as 𝒪(degi+u(SZ(n1)Sψ+SESψ))\mathcal{O}\!\left(\mathrm{deg}_{i+}^{u}\left(S_{\mathrm{Z}}^{(n-1)}S_{\psi}\!+\!S_{\mathrm{E}}S_{\psi}\right)\right), 𝒪(degiuSE)\mathcal{O}\!\left(\mathrm{deg}_{i-}^{u}S_{\mathrm{E}}\right), and 𝒪(SΨ(SZ(n1)+SE)+SΨSZ(n))\mathcal{O}\!\left(S_{\Psi}\left(S_{\mathrm{Z}}^{(n-1)}\!+\!S_{\mathrm{E}}\right)\!+\!S_{\Psi}S_{\mathrm{Z}}^{(n)}\right), where the outdegree/indegree degi±u=|𝒩i±u|\mathrm{deg}_{i\pm}^{u}=\left|\mathcal{N}_{i\pm}^{u}\right| indicates the number of outbound/inbound edges of agent iui\in\mathcal{I}^{u}, SES_{\mathrm{E}} is the size of the local embedding 𝐞ii(n)\mathbf{e}_{ii^{\prime}}^{(n)}, and SZ(n)S_{\mathrm{Z}}^{(n)} represents the size of the output vector at layer nn that is initialized as SZ(0)=(dNu+dEii)S_{\mathrm{Z}}^{(0)}=\left(d_{\mathrm{N}}^{u}+\mathrm{d}_{\mathrm{E}}^{ii^{\prime}}\right). Since all the agents can be calculated in parallel, the computational complexity of the whole NN MPGNN layers is 𝒪(n=1NCmax(n))\mathcal{O}\left(\sum_{n=1}^{N}C_{\max}^{(n)}\right), where Cmax(n)C_{\max}^{(n)} implies the maximal complexity of agents in MPGNN layer nn, i.e.,

Cmax(n)=maxiu,u{0,1}[degi+uSψ(NZ(n1)+SE)+degiuSE+SΨ(SZ(n1)+SE+SZ(n))].C_{\max}^{(n)}=\max_{i\in\mathcal{I}^{u},\atop u\in\{0,1\}}\bigg{[}\mathrm{deg}_{i+}^{u}S_{\psi}\left(N_{\mathrm{Z}}^{(n-1)}+S_{\mathrm{E}}\right)\\ +\mathrm{deg}_{i-}^{u}S_{\mathrm{E}}+S_{\Psi}\left(S_{\mathrm{Z}}^{(n-1)}+S_{\mathrm{E}}+S_{\mathrm{Z}}^{(n)}\right)\bigg{]}. (56)

On the other hand, since the complexity of GRU per weight and time step is 𝒪(1)\mathcal{O}(1), its computational complexity depends on the number of GRU weight parameters [39]. For agent iui\in\mathcal{I}^{u}, the total parameter number of one GRU cell is 3(Sinu+SGRU)SGRU3\left(S_{\mathrm{in}}^{u}+S_{\mathrm{GRU}}\right)S_{\mathrm{GRU}}, where Sinu=SZ(N)+dNuS_{\mathrm{in}}^{u}=S_{\mathrm{Z}}^{(N)}+d_{\mathrm{N}}^{u} denotes the size of the GRU input defined in (41), and SGRUS_{\mathrm{GRU}} is the number of neurons for each gate in the GRU cell. Based on the above analyses, the computational complexity of the proposed GE-VDAC algorithm can be obtained by

𝒪(n=1NCmax(n)+3(SZ(N)+maxu{0,1}dNu+SGRU)SGRU).\mathcal{O}\left(\sum_{n=1}^{N}C_{\max}^{(n)}+3\left(S_{\mathrm{Z}}^{(N)}+\max_{u\in\{0,1\}}d_{\mathrm{N}}^{u}+S_{\mathrm{GRU}}\right)S_{\mathrm{GRU}}\right). (57)

In comparison, the computational complexity of the traditional VDAC algorithm [33] (without information interaction) can be obtained by modifying the GRU input size and ignoring the MPGNN part in the GE-VDAC algorithm as 𝒪(3[SinVDAC+SGRU]SGRU)\mathcal{O}\left(3\left[S_{\mathrm{in}}^{\mathrm{VDAC}}+S_{\mathrm{GRU}}\right]S_{\mathrm{GRU}}\right), where SinVDAC=maxiu,u{0,1}(dNu+i𝒩i+udEii)S_{\mathrm{in}}^{\mathrm{VDAC}}=\max_{{i\in\mathcal{I}^{u},\atop u\in\{0,1\}}}\left(d_{\mathrm{N}}^{u}+\sum_{i^{\prime}\in\mathcal{N}_{i+}^{u}}d_{\mathrm{E}}^{ii^{\prime}}\right) is the maximal size of the GRU input at all agents. On the other hand, an advanced VDAC algorithm with direct information exchange among agents, denoted as IE-VDAC, has the complexity of 𝒪(3[SinIEVDAC+SGRU]SGRU)\mathcal{O}\left(3\left[S_{\mathrm{in}}^{\mathrm{IE-VDAC}}+S_{\mathrm{GRU}}\right]S_{\mathrm{GRU}}\right), where SinIEVDAC=maxiu,u{0,1}(dNu+i𝒩i+udEii+i𝒩iudEii)S_{\mathrm{in}}^{\mathrm{IE-VDAC}}=\max_{{i\in\mathcal{I}^{u},\atop u\in\{0,1\}}}\left(d_{\mathrm{N}}^{u}+\sum_{i^{\prime}\in\mathcal{N}_{i+}^{u}}d_{\mathrm{E}}^{ii^{\prime}}+\sum_{i^{\prime}\in\mathcal{N}_{i-}^{u}}d_{\mathrm{E}}^{i^{\prime}i}\right).

V Numerical Results

In this section, we present numerical results to demonstrate the effectiveness of the proposed smart reconfigurable massive MIMO-NOMA networks. We consider a network with M=3M=3 APs uniformly deployed on the ceiling of an 8×58\times 5 m2 indoor room. Furthermore, we respectively consider J={1,2,4}J=\{1,2,4\} RISs mounted on the walls. When J=1J=1, RIS is deployed in the middle of one wall. When J{2,4}J\in\{2,4\}, RISs are symmetrically and uniformly spaced on walls, respectively. Each AP mm equips NR=4N_{\mathrm{R}}=4 RF chains to serve KS=4K_{\mathrm{S}}=4 SE users and KU=KmKSK_{\mathrm{U}}=K^{m}-K_{\mathrm{S}} IoT users. For simplicity, we assume that the numbers of users associated with each AP are the same. The minimum rate requirements for SE users and IoT users are 22 Gbit/s and 0.10.1 Gbit/s, respectively. We further assume both SE and IoT users arrive the network under the Poisson distribution with the density of 1010 Gbit/s and 0.20.2 Gbit/s, respectively. Moreover, the maximum queue lengths for SE users and IoT users are set as 2525 Gbit/s and 1010 Gbit/s, and the violation probability is limited by 0.10.1. On the other hand, each RIS equips with L=20L=20 cost-effective RIS elements with quantization bit BR={1,2}B^{\mathrm{R}}=\{1,2\}. The LoS probabilities between AP-device and RIS-device links are calculated according to the human blockage model in [7]. Moreover, the reflection coefficient of NLoS path is given by λ=ρFρR\lambda=\rho^{\mathrm{F}}\rho^{\mathrm{R}} [19], where ρF=cosφinnr2sin2φincosφin+nr2sin2φin\rho^{\mathrm{F}}=\frac{\cos\varphi_{\mathrm{in}}-\sqrt{n_{\mathrm{r}}^{2}-\sin^{2}\varphi_{\mathrm{in}}}}{\cos\varphi_{\mathrm{in}}+\sqrt{n_{\mathrm{r}}^{2}-\sin^{2}\varphi_{\mathrm{in}}}} denotes the Fresnel reflection coefficient, and ρR=e12(4πfσscosφinc)\rho^{\mathrm{R}}=e^{-\frac{1}{2}\left(\frac{4\pi f\sigma_{\mathrm{s}}\cos\varphi_{\mathrm{in}}}{c}\right)} is the Rayleigh roughness factor. φin\varphi_{\mathrm{in}} represents the angle of incidence and reflection, nr=1.922+0.0057jn_{\mathrm{r}}=1.922+0.0057j means the refractive index, and σs=0.05\sigma_{\mathrm{s}}=0.05e-3 denotes the standard deviation of the surface characterizing the material roughness. Furthermore, the AWGN power spectral density is σ2=174\sigma^{2}=-174 dBm/Hz. The simulation parameters and the neural network (NN) settings are summarized in Table II. For GE-VDAC, we empirically learn N=2N=2 MPGNN layers with local embedding size SE=32S_{\mathrm{E}}=32 and output size SZ(1)=SZ(2)=48S_{\mathrm{Z}}^{(1)}=S_{\mathrm{Z}}^{(2)}=48.

TABLE II: Simulation Parameters
Parameters Values NN parameters Values
Frequency ff = 0.30.3 THz GNN layers N=2N=2
Absorption coeff. k(f)=0.0033k(f)=0.0033 m1m^{-1} Local embedding SE=32S_{\mathrm{E}}=32
Bandwidth 1010 GHz MLP neurons Sψ=SΨ=48S_{\psi}=S_{\Psi}=48
Antenna gain Υ=20\varUpsilon=20 dBi Combining function ϕu=mean()\phi_{u}=mean(\cdot)
Time slots T=40T=40 MPGNN outputs SZ(1)=SZ(2)=48S_{\mathrm{Z}}^{(1)}=S_{\mathrm{Z}}^{(2)}=48
Inefficiency ξ=1/0.38\xi=1/0.38 1st FC layer (Sinu,SGRU)\left(S_{\mathrm{in}}^{u},S_{\mathrm{GRU}}\right)
Maximum power PAPmax=5P_{\mathrm{AP}}^{\max}=5 W GRU neurons SGRU=64S_{\mathrm{GRU}}=64
Baseband power PBB=0.2P_{\mathrm{BB}}=0.2 W Last FC layer (SGRU,action size)\left(S_{\mathrm{GRU}},\text{action size}\right)
RF chain power PRF=0.16P_{\mathrm{RF}}=0.16 W Learning rates (κμ,\kappa_{\mu}, κπ\kappa_{\pi}, and κV\kappa_{\mathrm{V}}) 0.00050.0005
Phase shifter power PPS=0.03P_{\mathrm{PS}}=0.03 W
Amplifier power PPA=0.02P_{\mathrm{PA}}=0.02 W
RIS element power PRIS=BR×0.01P_{\mathrm{RIS}}=B^{\mathrm{R}}\times 0.01 W
Refer to caption
(a) Convergence of the test reward.
Refer to caption
(b) Convergence of the energy efficiency.
Figure 4: Convergence performance comparisons among different algorithms.
Refer to caption
(a) Energy efficiency.
Refer to caption
(b) Reliability of SE users.
Figure 5: Energy efficiency and reliability performance under different algorithms.

We compare the proposed GE-VDAC algorithm with the following three baselines:

  • Central critic: the multi-agent centralized-critic distributed-actor algorithm, which extends the traditional multi-agent actor-critic algorithm [29] to the hybrid discrete and continuous action space.

  • VDAC: the fully distributed VDAC algorithm [33] without information exchange among agents.

  • IE-VDAC: the VDAC algorithm with direct Information Exchange among neighboring agents.

The convergence performance comparisons among different algorithms are shown in Fig. 4, where all of the agents are trained for 150000150000 time steps, and we set K=24K=24, NA=64N_{\mathrm{A}}=64, b=1b=1. We can see that the value-decomposition based algorithms (VDAC, IE-VDAC, GE-VDAC) outperform central critic algorithm, since the coordination among agents is enhanced through difference rewards. The fully distributed VDAC algorithm requires no information exchange overhead and takes the least training time, but has lower learning speed and expected reward than both IE-VDAC and GE-VDAC. With less information exchange overhead than IE-VDAC, GE-VDAC can achieve the highest expected reward and learning speed, and the performance gap increases with JJ, demonstrating its ability to improve generalization and enhance agents’ coordination. Moreover, since GE-VDAC utilizes the dimension-reduced embedded features, it has lower complexity than IE-VDAC and takes less training time, and the time gap is larger as JJ increases.

Refer to caption
Figure 6: The relationship between sum rate and power allocation.

In Fig. 5, we show the energy efficiency and reliability performance of the smart reconfigurable THz MIMO-NOMA networks under different associated user numbers KK. Here, we further consider conventional THz massive MIMO-NOMA without the aid of RISs as benchmarks, which utilizes equal power allocation under channel-based/QoS-based user clustering and NOMA decoding. We can see that the CSI-based MIMO-NOMA scheme has the lowest reliability of SE users, and the performance gap dramatically grows as the associated users increase. Without the aid of RISs, the QoS-based MIMO-NOMA scheme can provide higher reliability than CSI-based MIMO-NOMA scheme, but leads to the lowest energy efficiency. However, the proposed algorithms can achieve the highest global energy efficiency, while guaranteeing the reliability of SE users. Even when the number of IoT users increases, the reliability of SE users can be guaranteed to be larger than 0.940.94, which verifies that the proposed reconfigurable THz MIMO-NOMA can provide on-demand QoS for heterogeneous users with efficient spectral utilization.

Fig. 6 and Fig. 7 present the performance of the proposed networks under different number of RISs NAN_{\mathrm{A}} and RIS quantization bit bb. As shown in Fig. 6, when NAN_{\mathrm{A}} and JJ increase, the power consumption increases due to the higher hardware power dissipation. Meanwhile, the sum rate of users increases with NAN_{\mathrm{A}} but decreases with JJ, since the spatial stream interferences become more severe as JJ growing. In Fig. 7(a), the energy efficiency declines with NAN_{\mathrm{A}} and JJ based on the proposed algorithms due to the higher circuit power dissipation. Compared with 11-bit RISs, 22-bit RISs result in higher power dissipation to achieve competitive data rate. Nevertheless, 22-bit RISs can achieve higher energy efficiency when J=4J=4 and the number of antennas is large, since they have higher flexibility to suppress the severe spatial interference. Furthermore, Fig. 7(b) shows the reliability of SE users increases with NAN_{\mathrm{A}}, JJ, and bb, which demonstrates the effectiveness of massive MIMO structure and the cooperative RISs to avoid blockage.

Refer to caption
(a) Energy efficiency.
Refer to caption
(b) Reliability of SE users.
Figure 7: System performance under different numbers of AP antennas.

VI Conclusion

In this work, we propose a novel smart reconfigurable THz MIMO-NOMA framework, which empowers customizable and intelligent indoor communications. We first propose specific schemes of customized user clustering, NOMA decoding, and hybrid beamforming. Thereafter, we formulate a long-term Dec-POMDP problem under a joint dynamic RIS element selection, coordinated discrete phase-shift control, and power allocation strategy, which maximizes the network energy efficiency while ensuring heterogenous data rate and reliability for on-demanded users. To efficiently solve the intractable highly complex MINLP problem, we further develop a decentralized GE-VDAC based MADRL algorithm. We theoretically demonstrate that the GE-VDAC algorithm achieves locally optimal convergence and better generalization. Numerical results show that the proposed algorithm can achieve higher energy efficiency and faster learning speed compared to traditional MADRL methods. Furthermore, we can also obtain highly customized communications through efficient resource utilization.

Appendix A Proof of Lemma 1

Considering the definition of qnkmq_{nk}^{m} and ([x]+)2x2\left(\left[x\right]^{+}\right)^{2}\leq x^{2}, we have

12(qnkm(t+1))212[qnkm(t)Rnkm(t)+Ankm(t)]212(qnkm(t))2+Cnkm,q+qnkm(t)(Ankm(t)Rnkm(t)).\begin{split}&\frac{1}{2}\left(q_{nk}^{m}(t+1)\right)^{2}\leq\frac{1}{2}\left[q_{nk}^{m}(t)-R_{nk}^{m}(t)+A_{nk}^{m}(t)\right]^{2}\\ &\leq\frac{1}{2}\left(q_{nk}^{m}(t)\right)^{2}+C_{nk}^{m,\mathrm{q}}+q_{nk}^{m}(t)\left(A_{nk}^{m}(t)-R_{nk}^{m}(t)\right).\end{split} (58)

Similarly, considering the definition of ZnkmZ_{nk}^{m}, we have

12(Ynkm(t+1)2Ynkm(t)2)(a)12(qnkm(t+1))2+Cnkm,Y+Ynkm(t)qnkm(t+1),\begin{split}&\frac{1}{2}\left(Y_{nk}^{m}(t+1)^{2}-Y_{nk}^{m}(t)^{2}\right)\\ &\overset{(a)}{\leq}\frac{1}{2}\left(q_{nk}^{m}(t+1)\right)^{2}+C_{nk}^{m,\mathrm{Y}}+Y_{nk}^{m}(t)q_{nk}^{m}(t+1),\end{split} (59)

where (a)(a) is due to qnkm,maxϵnkmq_{nk}^{m,\max}\epsilon_{nk}^{m} and qnkm(t)q_{nk}^{m}(t) are non-negative. Substituting (58) into (59) and considering the fact that qnkm(t+1)=qnkm(t)Rnkm(t)+Ankm(t)q_{nk}^{m}(t+1)=q_{nk}^{m}(t)-R_{nk}^{m}(t)+A_{nk}^{m}(t) due to the nonempty traffic queue, we can obtain

12(Ynkm(t+1)2Ynkm(t)2)Cnkm,q+Cnkm,Y(qnkm(t)+Ynkm(t))Rnkm(t)+Bnkm(t).\begin{split}&\frac{1}{2}\left(Y_{nk}^{m}(t+1)^{2}-Y_{nk}^{m}(t)^{2}\right)\\ &\leq C_{nk}^{m,\mathrm{q}}+C_{nk}^{m,\mathrm{Y}}-\left(q_{nk}^{m}(t)+Y_{nk}^{m}(t)\right)R_{nk}^{m}(t)+B_{nk}^{m}(t).\end{split} (60)

Substituting (58) and (60) into Δnkm(t)=12[qnkm(t+1)2qnkm(t)2]+12[Ynkm(t+1)2Ynkm(t)2]\Delta\mathcal{L}_{nk}^{m}(t)=\frac{1}{2}\big{[}q_{nk}^{m}(t+1)^{2}-q_{nk}^{m}(t)^{2}\big{]}+\frac{1}{2}\big{[}Y_{nk}^{m}(t+1)^{2}-Y_{nk}^{m}(t)^{2}\big{]} concludes the proof.

Appendix B Proof of the Theorem 1

For the sake of expression, we ignore the time slot index tt in the following proof. The joint policy can be written as the product of the graph embedding sub-policy and the independent action generation sub-policies:

π(𝐚|𝐬)=uiπθG(𝐳~iu|𝐎,𝐃)πθAu(𝐚iu|𝐳~iu).\pi\left(\mathbf{a}|\mathbf{s}\right)=\prod_{u}\prod_{i}\pi_{\theta_{\mathrm{G}}}\left(\widetilde{\mathbf{\mathbf{z}}}_{i}^{u}|\mathbf{O},\mathbf{D}\right)\pi_{\theta_{\mathrm{A}}}^{u}\left(\mathbf{a}_{i}^{u}|\widetilde{\mathbf{\mathbf{z}}}_{i}^{u}\right). (61)

Therefore, the policy gradient in (43) can be rewritten as

𝐠π=𝔼π[ui𝜽πlog(πθG(𝐳~iu|𝐎,𝐃)πθAu(𝐚iu|𝐳~iu))A(𝐚,𝐬)]=𝔼π[𝜽πlog(uiπθG(𝐳~iu|𝐎,𝐃)πθAu(𝐚iu|𝐳~iu))A(𝐚,𝐬)]=(61)𝔼π[𝜽πlogπ(𝐚|𝐬)(Q(𝐚,𝐬)Vtot(𝐬))],\small\begin{split}\mathbf{g}_{\pi}&=\mathbb{E}_{\pi}\left[\sum_{u}\sum_{i}\nabla_{\bm{\theta}_{\pi}}\log\left(\pi_{\theta_{\mathrm{G}}}\left(\widetilde{\mathbf{\mathbf{z}}}_{i}^{u}|\mathbf{O},\mathbf{D}\right)\pi_{\theta_{\mathrm{A}}}^{u}\left(\mathbf{a}_{i}^{u}|\widetilde{\mathbf{\mathbf{z}}}_{i}^{u}\right)\right)A\left(\mathbf{a},\mathbf{s}\right)\right]\\ &=\mathbb{E}_{\pi}\left[\nabla_{\bm{\theta}_{\pi}}\log\left(\prod_{u}\prod_{i}\pi_{\theta_{\mathrm{G}}}\left(\widetilde{\mathbf{\mathbf{z}}}_{i}^{u}|\mathbf{O},\mathbf{D}\right)\pi_{\theta_{\mathrm{A}}}^{u}\left(\mathbf{a}_{i}^{u}|\widetilde{\mathbf{\mathbf{z}}}_{i}^{u}\right)\right)A\left(\mathbf{a},\mathbf{s}\right)\right]\\ &\normalsize\overset{\eqref{joint_policy}}{=}\mathbb{E}_{\pi}\left[\nabla_{\bm{\theta}_{\pi}}\log\pi\left(\mathbf{a}|\mathbf{s}\right)\left(Q\left(\mathbf{a},\mathbf{s}\right)-V_{\mathrm{tot}}\left(\mathbf{s}\right)\right)\right],\end{split} (62)

where 𝜽π=[{𝜽Gu},{𝜽Au}]\bm{\theta}_{\pi}=\left[\left\{\mathbf{\bm{\theta}}_{\mathrm{G}}^{u}\right\},\left\{\mathbf{\bm{\theta}}_{\mathrm{A}}^{u}\right\}\right] is the joint policy parameter vector.

Denoting dπ(s)d^{\pi(s)} as the discounted ergodic state distribution, we can obtain [33], [40]

𝜽πlogπ(𝐚|𝐬)Vtot(𝐬)=𝐬dπ(s)Vtot(𝐬)𝜽π𝐚logπ(𝐚|𝐬)=𝐬dπ(s)Vtot(𝐬)𝜽π1=𝟎.\begin{split}&\nabla_{\bm{\theta}_{\pi}}\log\pi\left(\mathbf{a}|\mathbf{s}\right)V_{\mathrm{tot}}\left(\mathbf{s}\right)\\ =&\sum_{\mathbf{s}}d^{\pi(s)}V_{\mathrm{tot}}\left(\mathbf{s}\right)\nabla_{\bm{\theta}_{\pi}}\sum_{\mathbf{a}}\log\pi\left(\mathbf{a}|\mathbf{s}\right)\\ =&\sum_{\mathbf{s}}d^{\pi(s)}V_{\mathrm{tot}}\left(\mathbf{s}\right)\nabla_{\bm{\theta}_{\pi}}1=\mathbf{0}.\end{split} (63)

Substituting (63) into (62), we have

𝐠π=𝔼π[𝜽πlogπ(𝐚|𝐬)Q(𝐬,𝐚)],\mathbf{g}_{\pi}=\mathbb{E}_{\pi}\left[\nabla_{\bm{\theta}_{\pi}}\log\pi\left(\mathbf{a}|\mathbf{s}\right)Q\left(\mathbf{s},\mathbf{a}\right)\right], (64)

which leads to a standard policy gradient for single-agent actor-critic algorithm. In [41], it is proven that an actor-critic based on such gradient can converge to a local maximal expected return, which satisfies (55). This ends the proof.

References

  • [1] M. Giordani, M. Polese, M. Mezzavilla, S. Rangan, and M. Zorzi, “Toward 6G networks: use cases and technologies,” IEEE Commun. Mag., vol. 58, no. 3, pp. 55-61, Mar. 2020.
  • [2] L. Yan, C. Han, and J. Yuan, “A dynamic array-of-subarrays architecture and hybrid precoding algorithms for terahertz wireless communications,” IEEE J. Sel. Areas Commun., vol. 38, no. 9, pp. 2041-2056, Sept. 2020.
  • [3] Y. Liu, Z. Qin, M. Elkashlan, Z. Ding, A. Nallanathan, and L. Hanzo, “Nonorthogonal multiple access for 5G and beyond,” Proc. IEEE, vol. 105, no. 12, pp. 2347-2381, Dec. 2017.
  • [4] Y. Liu, H. Xing, C. Pan, A. Nallanathan, M. Elkashlan, and L. Hanzo, “Multiple-antenna-assisted non-orthogonal multiple access,” IEEE Wireless Commun., vol. 25, no. 2, pp. 17-23, Apr. 2018.
  • [5] M. Zeng, A. Yadav, O. A. Dobre, G. I. Tsiropoulos, and H. V. Poor, “Capacity comparison between MIMO-NOMA and MIMO-OMA with multiple users in a cluster,” IEEE J. Sel. Areas Commun., vol. 35, no. 10, pp. 2413-2424, Oct. 2017.
  • [6] H. Zhang, H. Zhang, W. Liu, K. Long, J. Dong, and V. C. M. Leung, “Energy efficient user clustering, hybrid precoding and power optimization in terahertz MIMO-NOMA systems,” IEEE J. Sel. Areas Commun., vol. 38, no. 9, pp. 2074-2085, Sept. 2020.
  • [7] Y. Wu, J. Kokkoniemi, C. Han, and M. Juntti, “Interference and coverage analysis for terahertz networks with indoor blockage effects and line-of-sight access point association,” IEEE Trans. Wireless Commun., vol. 20, no. 3, pp. 1472-1486, Mar. 2021.
  • [8] S. Gong, X. Lu, D. T. Hoang, D. Niyato, L. Shu, D. I. Kim, and Y. C. Liang, “Toward smart wireless communications via intelligent reflecting surfaces: A contemporary survey,” IEEE Commun. Surveys Tuts., vol. 22, no. 4, pp. 2283-2314, 4th Quart. 2020.
  • [9] M. D. Renzo, A. Zappone, M. Debbah, M. Alouini, C. Yuen, J. Rosny, and S. Tretyakov, “Smart radio environments empowered by reconfigurable intelligent surfaces: How it works, state of research, and the road ahead,” IEEE J. Sel. Areas Commun., vol. 38, no. 11, pp. 2450-2525, Nov. 2020.
  • [10] Y. Liu, X. Liu, X. Mu, T. Hou, J. Xu, M. D. Renzo, and N. Al-Dhahir, “Reconfigurable intelligent surfaces: principles and opportunities,” IEEE Commun. Surveys Tuts., vol. 23, no. 3, pp. 1546-1577, 3rd Quart. 2021.
  • [11] C. Chaccour, M. Naderi Soorki, W. Saad, M. Bennis, and P. Popovski, “Risk-based optimization of virtual reality over terahertz reconfigurable intelligent surfaces,” in Proc. IEEE Int. Conf. Commun. (ICC), Jun. 2020, pp. 1-6.
  • [12] Y. Liu, X. Mu, X. Liu, M. D. Renzo, Z. Ding, and R. Schober, “Reconfigurable intelligent surface (RIS) aided multi-user networks: Interplay between NOMA and RIS,” 2020, arXiv:2011.13336. [Online]. Available: https://arxiv.org/abs/2011.13336.
  • [13] X. Mu, Y. Liu, L. Guo, J. Lin, and N. Al-Dhahir, “Capacity and optimal resource allocation for IRS-assisted multi-user communication systems,” IEEE Trans. on Commun., vol. 69, no. 6, pp. 3771-3786, Jun. 2021.
  • [14] Y. Li, M. Jiang, Q. Zhang, and J. Qin, “Joint beamforming design in multi-cluster MISO NOMA intelligent reflecting surface-aided downlink communication networks,” 2019, arXiv:1909.06972. [Online]. Available: http://arxiv.org/abs/1909.06972.
  • [15] X. Mu, Y. Liu, L. Guo, J. Lin, and N. Al-Dhahir, “Exploiting intelligent reflecting surfaces in NOMA networks: Joint beamforming optimization,” IEEE Trans. Wireless Commun., vol. 19, no. 10, pp. 688-6898, Oct. 2020.
  • [16] X. Liu, Y. Liu, Y. Chen, and H. V. Poor, “RIS enhanced massive non-orthogonal multiple access networks: Deployment and passive beamforming design,” IEEE J. Sel. Areas Commun., vol. 39, no. 4, pp. 1057-1071, Apr. 2021.
  • [17] X. Zhu, Z. Wang, L. Dai, and Q. Wang, “Adaptive hybrid precoding for multiuser massive MIMO,” IEEE Commun. Lett., vol. 20, no. 4, pp. 776-779, Apr. 2016.
  • [18] Q. Wu, S. Zhang, B. Zheng, C. You, and R. Zhang, “Intelligent reflecting surface-aided wireless communications: A tutorial,” IEEE Trans. on Commun., vol. 69, no. 5, pp. 3313-3351, May 2021.
  • [19] R. Piesiewicz, C. Jansen, D. Mittleman, T. Kleine-Ostmann, M. Koch, and T. Kurner, “Scattering analysis for the modeling of THz communication systems,” IEEE Trans. Antennas Propag., vol. 55, no. 11, pp. 3002-3009, Nov. 2007.
  • [20] J. M. Jornet and I. F. Akyildiz, “Channel modeling and capacity analysis for electromagnetic wireless nanonetworks in the terahertz band,” IEEE Trans. Wireless Commun., vol. 10, no. 10, pp. 3211-3221, Oct. 2011.
  • [21] L. Dai, B. Wang, M. Peng, and S. Chen, “Hybrid precoding-based millimeter-wave massive MIMO-NOMA with simultaneous wireless information and power transfer,” IEEE J. Sel. Areas Commun., vol. 37, no. 1, pp. 131-141, Jan. 2019.
  • [22] S. Samarakoon, M. Bennis, W. Saad, and M. Debbah, “Distributed federated learning for ultra-reliable low-latency vehicular communications,” IEEE Trans. Commun., vol. 68, no. 2, pp. 1146-1159, Nov. 2019.
  • [23] C. Huang, A. Zappone, G. C. Alexandropoulos, M. Debbah, and C. Yuen, “Reconfigurable intelligent surfaces for energy efficiency in wireless communication,” IEEE Trans. Wireless Commun., vol. 18, no. 8, pp. 4157-4170, Aug. 2019.
  • [24] S. Khairy, P. Balaprakash, L. X. Cai, and Y. Cheng, “Constrained deep reinforcement learning for energy sustainable multi-UAV based random access IoT networks with NOMA,” IEEE J. Sel. Areas Commun., vol. 39, no. 4, pp. 1101-1115, Apr. 2021.
  • [25] W. Wu, P. Yang, W. Zhang, C. Zhou, and S. Shen, “Accuracy-guaranteed collaborative DNN inference in industrial IoT via deep reinforcement learning,” IEEE Trans. Ind. Inform., vol. 17, no. 7, pp. 4988-4998, Jul. 2021.
  • [26] M. J. Neely, “Stochastic network optimization with application to communication and queueing systems,” Synthesis Lectures on Communication Networks, vol. 3, no. 1, pp. 1-211, 2010.
  • [27] B. K. Ghosh, “Probability inequalities related to Markov’s theorem,” Amer. Statist., vol. 56, no. 3, pp. 186-190, Aug. 2002.
  • [28] T. T. Nguyen, N. D. Nguyen, and S. Nahavandi, “Deep reinforcement learning for multiagent systems: A review of challenges, solutions, and applications,” IEEE Trans. Cybern., vol. 50, no. 9, pp. 3826-3839, Sept. 2020.
  • [29] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environments,” Proc. Adv. Neural Inf. Process. Syst., pp. 6379-6390, 2017.
  • [30] D. H. Wolpert and K. Tumer, “Optimal payoff functions for members of collectives,” Adv. Complex Syst., vol. 4, pp. 265-279, 2001.
  • [31] T. Rashid, M. Samvelyan, C. S. De Witt, G. Farquhar, J. Foerster, and S. Whiteson, “QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learning,” 2018, arXiv:1803.11485. [Online]. Available: http://arxiv.org/abs/1803.11485.
  • [32] K. Son, D. Kim, W. J. Kang, D. E. Hostallero, and Y. Yi, “QTRAN: Learning to factorize with transformation for cooperative multi-agent reinforcement learning,” 2019, arXiv:1905.05408. [Online]. Available: http://arxiv.org/abs/1905.05408.
  • [33] J. Su, S. Adams, and P. A. Beling, “Value-decomposition multi-agent actor-critics,” Proc. 35th AAAI Conf. Artif. Intell., pp. 11352-11360, 2021.
  • [34] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural networks?” Proc. Int. Conf. Learn. Represent., May 2019. [Online]. Available: https://openreview.net/forum?id=ryGs6iA5Km.
  • [35] Y. Shen, Y. Shi, J. Zhang, and K. B. Letaief, “Graph neural networks for scalable tadio tesource management: Architecture design and theoretical analysis,” IEEE J. Sel. Areas Commun., vol. 39, no. 1, pp. 101-115, Jan. 2021.
  • [36] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” 2014, arXiv:1406.1078. [Online]. Available: http://arxiv.org/abs/1406.1078.
  • [37] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” 2014, arXiv:1412.3555. [Online]. Available: http://arxiv.org/abs/1412.3555.
  • [38] D. Ha, A. Dai, and Q. V. Le, “Hypernetworks,” 2016, arXiv:1609.09106. [Online]. Available: http://arxiv.org/abs/1609.09106.
  • [39] J. Xie, J. Fang, C. Liu, and X. Li, “Deep learning-based spectrum sensing in cognitive radio: A CNN-LSTM approach”, IEEE Commun. Lett., vol. 24, no. 10, pp. 2196-2200, Oct. 2020.
  • [40] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” Advances Neural Inform. Processing Syst., Vol. 99, pp. 1057-1063, Nov. 2000.
  • [41] V. R. Konda and J. N. Tsitsiklis, “Actor-critic algorithms,” Advances Neural Inform. Processing Syst., pp. 1008-1014, 2000.