This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Mini-Batch Gradient-Based MCMC for Decentralized Massive MIMO Detection

Xingyu Zhou,  Le Liang,  Jing Zhang,  Chao-Kai Wen,  and Shi Jin,  This paper was accepted in part for presentation at the International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), Lucca, Italy, in September 2024 [1].X. Zhou, J. Zhang, and S. Jin are with the National Mobile Communications Research Laboratory, Southeast University, Nanjing 210096, China (e-mail: xy_zhou@seu.edu.cn; jingzhang@seu.edu.cn; jinshi@seu.edu.cn).L. Liang is with the National Mobile Communications Research Laboratory, Southeast University, Nanjing 210096, China, and also with Purple Mountain Laboratories, Nanjing 211111, China (e-mail: lliang@seu.edu.cn).C.-K. Wen is with the Institute of Communications Engineering, National Sun Yat-sen University, Kaohsiung 80424, Taiwan (e-mail: chaokai.wen@mail.nsysu.edu.tw).
Abstract

Massive multiple-input multiple-output (MIMO) technology has significantly enhanced spectral and power efficiency in cellular communications and is expected to further evolve towards extra-large-scale MIMO. However, centralized processing for massive MIMO faces practical obstacles, including excessive computational complexity and a substantial volume of baseband data to be exchanged. To address these challenges, decentralized baseband processing has emerged as a promising solution. This approach involves partitioning the antenna array into clusters with dedicated computing hardware for parallel processing. In this paper, we investigate the gradient-based Markov chain Monte Carlo (MCMC) method—an advanced MIMO detection technique known for its near-optimal performance in centralized implementation—within the context of a decentralized baseband processing architecture. This decentralized design mitigates the computation burden at a single processing unit by utilizing computational resources in a distributed and parallel manner. Additionally, we integrate the mini-batch stochastic gradient descent method into the proposed decentralized detector, achieving remarkable performance with high efficiency. Simulation results demonstrate substantial performance gains of the proposed method over existing decentralized detectors across various scenarios. Moreover, complexity analysis reveals the advantages of the proposed decentralized strategy in terms of computation delay and interconnection bandwidth when compared to conventional centralized detectors.

Index Terms:
Massive MIMO, decentralized baseband processing, data detection, Markov chain Monte Carlo, mini-batch stochastic gradient descent.

I Introduction

MASSIVE multiple-input multiple-output (MIMO) technology has emerged as a promising solution to address the increasing demand for high data rates and improved spectral efficiency in wireless communication systems of the fifth generation and beyond [2]. By deploying a large number of antennas at the base station (BS), it becomes feasible to achieve a substantial enhancement in network capacity and throughput. However, the utilization of an extremely large antenna array introduces high-dimensional signal processing tasks, particularly challenging in MIMO uplinks, where accurate detection of signals from the users is required to fulfill the potential of massive MIMO [3, 4].

Many existing MIMO detection schemes, such as the optimal maximum likelihood (ML) detector and various linear and nonlinear detectors [3], are designed according to the conventional centralized baseband processing architecture. However, this architecture faces several challenges when applied to extremely large-scale MIMO systems. In this setup, a central unit (CU) assumes responsibility for the entire symbol detection process, imposing stringent demands on the computational capability, baseband interconnection bandwidth, and storage capacity of a single unit [5, 6, 7]. Notably, existing MIMO detectors inherently involve matrix operations, such as matrix multiplications or inversions, which become impractical on a single computing device as the antenna count increases due to substantial computational complexity. Additionally, for centralized MIMO detection, the transfer of global channel state information and raw baseband data from all BS antennas to the CU is required. This process demands significant interconnection bandwidth, often surpassing data rate limits of existing interconnection standards or chip I/O interfaces [8, 6, 5].

To address these challenges, decentralized baseband processing (DBP) has been proposed and gained significant attention recently [6]. The concept behind DBP is to partition the large-scale antenna array into a few smaller disjoint antenna clusters, each with local radio frequency (RF) chains and independent baseband processing modules located in distributed units (DUs). The DUs handle local baseband data received from RF chains within their respective clusters. This DBP approach distributes the demanding signal processing tasks across parallel procedures on multiple computing units, like field-programmable gate arrays [9] or graphics processing units [10], substantially reducing computation demands at the CU and overall processing latency. Moreover, the requirement for raw baseband data transfer between each baseband unit and the corresponding RF circuitry is considerably diminished compared to the centralized architecture, as only a subset of BS antennas needs to transmit their data to the relevant processing module.

Despite the potential advantages of the DBP architecture, research into customized decentralized MIMO detection schemes within this framework remains incomplete. Specifically, the exploration of high-performance decentralized detectors is as important as it is currently lacking. Pioneering work [6] decentralized the conjugate gradient algorithm and the alternating direction method of multipliers (ADMM), resulting in iterative consensus-sharing schemes approaching the performance of the centralized linear minimum mean square error (LMMSE) method. However, the latter is widely known to be suboptimal. Similar endeavors were pursued in [9] and [11] to decentralize the Newton and coordinate descent methods, respectively, aiming to bypass the intricate matrix inversions involved in centralized linear detectors. Furthermore, authors of [12] developed decentralized feedforward architectures [5, 13] for classical maximum ratio combining, zero-forcing, and LMMSE detectors, considerably reducing system latency. Nonetheless, these linear detectors are markedly suboptimal, and the fully decentralized scheme experiences substantial performance degradation with an increasing number of clusters.

Instead of decentralizing linear detectors, an alternative research direction explores the decentralized design of nonlinear message passing-based detection methods. Specifically, authors of [12] also applied the large-MIMO approximate message passing (LAMA) algorithm within partially and fully decentralized feedforward architectures. However, LAMA’s instability arises when channel fading coefficients deviate from the independent and identically distributed (IID) Gaussian distribution. In [14], a decentralized detector based on expectation propagation (EP) was proposed, demonstrating performance advantages over various decentralized linear detectors [6, 10, 11]. Nonetheless, the decentralized EP method necessitates iterative matrix inversions at each DU, posing challenges for achieving computational efficiency with increasing user numbers. Other message passing detector variants tailored for the decentralized architecture have been explored [15, 16, 17]. Regrettably, these endeavors have yet to strike a desirable balance between performance and complexity.

As evident from the discussion above, existing decentralized MIMO detectors often struggle with suboptimal trade-offs between performance and complexity—particularly in scenarios with high user density or strong correlation channels. These limitations underscore the need to explore advanced detection techniques for formulating more competitive decentralized detectors. Recently, Markov chain Monte Carlo (MCMC) has demonstrated its potential to overcome the limitations of deterministic Bayesian inference approaches (e.g., EP) [18, 19]. MCMC-based MIMO detectors are favored for their near-optimal performance [20, 21] and hardware-friendly architecture [22]. Furthermore, the combination of gradient descent (GD) optimization and MCMC offers a framework to efficiently solve the MIMO detection optimum search problem [23, 24, 25, 26]. This scheme employs GD to explore important regions of the discrete search space and utilizes the random walk and acceptance test of MCMC to generate samples conforming to the posterior distribution of transmit symbols [27]. Thus, the gradient-based MCMC method holds the potential to achieve near-ML performance, unattainable by LMMSE or message passing-based detectors, while maintaining acceptable complexity. Notably, recent work [27] combines MCMC with Nesterov’s accelerated gradient (NAG) method [28] to derive a NAG-aided MCMC (NAG-MCMC) detector with near-optimal performance, high efficiency, and excellent scalability. However, existing literature has not explored the decentralized design of potent MCMC-based MIMO detectors and their gradient-based variations.

Motivated by the potential benefits of the DBP architecture and the limited exploration of decentralized detectors within this framework, this study seeks to establish a decentralized gradient-based MCMC detector operating within the DBP framework. The central goal is to mitigate computational and interconnection bandwidth limitations associated with conventional centralized processing, all while upholding near-optimal detection performance. The contributions of this paper are summarized as follows:

  • Decentralized Gradient-Based MCMC Detector for DBP Architecture: This study pioneers the decentralization of the gradient-based MCMC method for MIMO detection within the DBP architecture, employing both star and daisy-chain topologies. The proposed decentralized detector inherits the remarkable detection performance and scalability inherent in the original gradient-based MCMC algorithm. Decentralization enables distributed and parallel computations, efficiently redistributing the computation burden from the CU to multiple DUs. Consequently, our proposed detector exhibits high computational efficiency in terms of latency.

  • Mini-Batch-Based Gradient Updating for Efficiency Enhancement: We introduce a mini-batch-based gradient updating rule for the decentralized gradient-based MCMC detector, resulting in the Mini-NAG-MCMC algorithm. This innovation significantly reduces computation and interconnection demands by selecting only a mini-batch of DUs for gradient computations per iteration. The integration of stochastic gradients within the updating rule synergizes with the MCMC technique, facilitating escape from local minima and achieving near-optimal performance. Additionally, we leverage the asymptotically favorable propagation property of massive MIMO systems, incorporating an approximation into the proposed detector to further reduce overhead.

  • Comprehensive Analysis and Performance Validation: We conduct a comprehensive analysis of the proposed method’s computational complexity and interconnection bandwidth, showcasing its superiority over centralized schemes. Furthermore, extensive simulations validate the method’s performance superiority over existing decentralized detectors across diverse scenarios, including IID Rayleigh fading MIMO channels and realistic 3rd Generation Partnership Project (3GPP) channel models. These results affirm the proposed method’s effective balance between performance and complexity.

Refer to caption
Figure 1: Block diagram of the uplink massive MIMO system with the DBP architecture. The solid lines depict the connections between CU and DUs in the star DBP topology, while the dashed lines depict the counterparts in the daisy-chain DBP topology.

Notations: Boldface letters denote column vectors or matrices. 𝟎\mathbf{0} denotes a zero vector, and 𝐈\mathbf{I} denotes an identity matrix. 𝐀T\mathbf{A}^{T}, 𝐀H\mathbf{A}^{\rm H}, and 𝐀1\mathbf{A}^{-1} denote the transpose, conjugate transpose, and inverse of the matrix 𝐀\mathbf{A}, respectively. diag(𝐀){\rm diag}(\mathbf{A}) generates a diagonal matrix by assigning zero values to the off-diagonal elements of 𝐀\mathbf{A}. \|\cdot\| is l2l_{2} norm, and F\|\cdot\|_{F} is Frobenius norm. 𝔼[]\mathbb{E}[\cdot] represents the expectation of random variables. 𝒞𝒩(μ,σ2)\mathcal{CN}(\mu,\sigma^{2}) denotes a circularly symmetric complex Gaussian distribution with mean μ\mu and variance σ2\sigma^{2}. 𝒰(a,b)\mathcal{U}(a,b) indicates a uniform distribution between [a,b][a,b]. \mathbb{R} and \mathbb{C} are the sets of real and complex numbers, respectively. The set [K]={1,2,,K}[K]=\{1,2,\ldots,K\} contains all nonnegative integers up to KK.

II System Model

In this section, we initially present the uplink MIMO system model and its associated MIMO detection problem. Subsequently, we elaborate on the particulars of the DBP architecture for MIMO uplink and discuss the fundamental aspects of designing decentralized MIMO detectors.

II-A Uplink MIMO System Model

Consider an uplink massive MIMO system with UU single-antenna users and a BS equipped with BUB\geq U receiving antennas, as shown in Fig. 1. Each user independently encodes its information bit stream, modulates the coded bits using the quadrature amplitude modulation (QAM) scheme, and transmits the modulated symbols over the wireless channels to the BS. The input-output relationship111We consider a flat-fading channel in the system model; however, the proposed method can be readily generalized to handle frequency-selectivity by using orthogonal frequency division multiplexing (OFDM). In this approach, MIMO detection is independently conducted on each flat-fading subcarrier. of this uplink transmission is given by

𝐲=𝐇𝐱+𝐧,\mathbf{y}=\mathbf{Hx}+\mathbf{n}, (1)

where 𝐲B×1\mathbf{y}\in\mathbb{C}^{B\times 1} is the received signal at the BS, 𝐇B×U\mathbf{H}\in\mathbb{C}^{B\times U} is the uplink channel matrix, 𝐱𝒜U×1\mathbf{x}\in\mathcal{A}^{U\times 1} consists of transmit symbols from each user, with 𝒜\mathcal{A} representing the QAM lattice set having a cardinality of MM and unit symbol energy, and 𝐧B×1\mathbf{n}\in\mathbb{C}^{B\times 1} represents the IID circular complex Gaussian noise with zero-mean and variance σ2\sigma^{2} entries. The signal-to-noise ratio (SNR) is given by SNR=𝔼[𝐇𝐱2]/𝔼[𝐧2]\text{SNR}=\mathbb{E}[\|\mathbf{Hx}\|^{2}]/\mathbb{E}[\|\mathbf{n}\|^{2}]. The objective of uplink MIMO detection is to estimate the transmit vector 𝐱\mathbf{x} given the received signal 𝐲\mathbf{y} and the knowledge of the uplink channel matrix 𝐇\mathbf{H}. The optimal ML solution is given by

𝐱=argmin𝐱𝒜U×1𝐲𝐇𝐱2.{\mathbf{x}}^{\ast}=\underset{\mathbf{x}\in\mathcal{A}^{{U\times 1}}}{\arg\min}\;\|\mathbf{y}-\mathbf{Hx}\|^{2}. (2)

Note that the finite set of 𝐱\mathbf{x} makes the optimal MIMO detection an integer least-squares problem, known to be non-deterministic polynomial-time hard (NP-hard) [3].

II-B Decentralized Baseband Processing

To deal with the heavy interconnection and computation burdens due to the use of extra-large-scale antenna arrays, DBP has been proposed as a promising alternative to traditional centralized processing [6]. Throughout the paper, we assume that the massive MIMO BS possesses the DBP architecture as shown in Fig. 1. The BS antenna array is partitioned into C1C\geq 1 antenna clusters, where the cc-th cluster is associated with BcB_{c} antennas (c[C]c\in[C]), and c=1CBc=B\sum_{c=1}^{C}B_{c}=B. Without loss of generality, we assume that each cluster is of equal size, i.e., Bc=B/CB_{c}=B/C. Each cluster is equipped with local RF chains that are connected to dedicated hardware referred to as the DU.

We consider two widely accepted DBP topologies, namely star and daisy-chain topologies [6, 7, 9, 29, 30], which are depicted in Fig. 1. We assume data links among the CU and DUs are bidirectional for both topologies. The star topology is featured by all DUs connected to the CU (solid lines). Thus, the information exchange between the CU and each DU is in parallel. The daisy-chain topology is featured by DUs connected sequentially [7] (dashed lines), with a single connection to the CU at the end. Hence, communication is limited solely to adjacent units in the daisy-chain topology and the information exchange is performed in a sequential manner. Both these topologies have their distinct strengths in the trade-off among throughput, latency, and interconnection bandwidth [9]. The star topology is well recognized for the low data transfer latency owing to the simultaneous information exchange between the CU and each DU, while the daisy-chain topology is known for the constant interconnection bandwidth regardless of the number of DUs, as will be shown in Sec. IV-B.

Under the DBP architecture, the uplink channel matrix, received signal, and noise vector are partitioned row-wise as 𝐇=[𝐇1T,,𝐇CT]T\mathbf{H}=[\mathbf{H}_{1}^{T},\ldots,\mathbf{H}_{C}^{T}]^{T}, 𝐲=[𝐲1T,,𝐲CT]T\mathbf{y}=[\mathbf{y}_{1}^{T},\ldots,\mathbf{y}_{C}^{T}]^{T}, and 𝐧=[𝐧1T,,𝐧CT]T\mathbf{n}=[\mathbf{n}_{1}^{T},\ldots,\mathbf{n}_{C}^{T}]^{T}, respectively. Hence, the signal model at the cc-th antenna cluster can be written as

𝐲c=𝐇c𝐱+𝐧c,c[C],\mathbf{y}_{c}=\mathbf{H}_{c}\mathbf{x}+\mathbf{n}_{c},\;c\in[C], (3)

where 𝐇cBc×U\mathbf{H}_{c}\in\mathbb{C}^{B_{c}\times U} is the channel matrix between the UU users and the cc-th antenna cluster. 𝐲cBc×1\mathbf{y}_{c}\in\mathbb{C}^{B_{c}\times 1} and 𝐧cBc×1\mathbf{n}_{c}\in\mathbb{C}^{{B}_{c}\times 1} represent the local received signal and noise vector of the cc-th antenna cluster, respectively.

Each DU cc exclusively utilizes its local channel matrix 𝐇c\mathbf{H}_{c} and the received signal 𝐲c\mathbf{y}_{c} to compute intermediate detection results.222We assume that the local channel matrix 𝐇c\mathbf{H}_{c} is perfectly known at each DU since we focus on MIMO detector design in this paper. In realistic scenarios, this knowledge can be obtained by performing local channel estimation. The CU works as a coordinator for periodically aggregating the intermediate information from the DUs and forming the global consensus [6], which is then broadcast to the DUs for subsequent updates, and finally producing the estimate. In this decentralized scheme, the heavy computation burden at the CU can be offloaded to DUs and be performed in a parallel and distributed fashion. Hence, the overall computation latency would be significantly reduced. Meanwhile, the DUs can process the locally stored data to obtain low-dimensional intermediate information instead of directly transmitting full-dimensional data, thus saving interconnection bandwidth.

III Mini-Batch Gradient-Based MCMC

In this section, we first outline the basics of the gradient-based MCMC method from [27]. Then, we detail the development of the decentralized gradient-based MCMC detector, Mini-NAG-MCMC, customized for the DBP architecture.

III-A Basics of Gradient-Based MCMC

The MCMC method samples from the target posterior distribution using a Markov chain given by 𝐱0,𝐱1,,𝐱t,,𝐱S\mathbf{x}_{0},\mathbf{x}_{1},\ldots,\mathbf{x}_{t},\ldots,\mathbf{x}_{S}, where SS is the number of sampling iterations. The state transition 𝐱t1𝐱t\mathbf{x}_{t-1}\to\mathbf{x}_{t} in this chain is regulated by a transition kernel T(𝐱|𝐱t1)T(\mathbf{x}|\mathbf{x}_{t-1}), specifically designed to align the chain’s stationary distribution with the target distribution. Hence, the realization of this chain can be utilized to (approximately) generate samples from the target distribution, facilitating statistical inference. Notably, the integration of MCMC and the GD method has recently emerged as a promising approach for resolving nonconvex optimization challenges [23]. The underlying concept of this gradient-based MCMC approach involves leveraging gradients and the Metropolis-Hastings (MH) criterion [31, 25] to design the transition kernel, expediting the Markov chain’s convergence to the target distribution. When using gradient-based MCMC for MIMO detection, each state of the Markov chain corresponds to a sample of 𝐱\mathbf{x} over the discrete space 𝒜U×1\mathcal{A}^{U\times 1}. The initial sample 𝐱0\mathbf{x}_{0} is either randomly generated or transferred from another detector. The transition from 𝐱t1\mathbf{x}_{t-1} to 𝐱t\mathbf{x}_{t}, i.e., the generation of the tt-th sample 𝐱t\mathbf{x}_{t}, includes the following steps [25, 27]:

III-A1 GD

The constraint 𝐱𝒜U×1\mathbf{x}\in\mathcal{A}^{U\times 1} on the transmit vector can be relaxed to the continuous complex space U×1\mathbb{C}^{U\times 1} to allow the use of gradients. With this relaxation, the GD method can be performed to provide efficient navigation to important regions of the discrete search space. Initializing 𝐳0=𝐱t1\mathbf{z}_{0}=\mathbf{x}_{t-1}, where 𝐱t1\mathbf{x}_{t-1} is the previous sample, the GD iteration takes the basic form as follows:

𝐳k=𝐳k1τf(𝐳k1),k[Ng],\mathbf{z}_{k}=\mathbf{z}_{k-1}-\tau\nabla f(\mathbf{z}_{k-1}),\;k\in[N_{\rm g}], (4)

where kk is the index of GD iteration, NgN_{\rm g} is the total number of GD iterations, τ\tau is the learning rate, f()f(\cdot) is the objective function of MIMO detection, which is generally chosen as the residual norm from the estimate, f(𝐱)𝐲𝐇𝐱2f(\mathbf{x})\propto\|\mathbf{y}-\mathbf{Hx}\|^{2}, and f()\nabla f(\cdot) is the gradient of the objective function.

III-A2 Random Walk and QAM Mapping

After a preset number of GD steps, we generate the MCMC candidate sample 𝐱[cand]\mathbf{x}^{[\rm cand]} as follows:

𝐱[cand]\displaystyle\mathbf{x}^{[\rm cand]} =Q(𝐳Ng+𝐝),\displaystyle=Q(\mathbf{z}_{N_{\rm g}}+\mathbf{d}), (5)

where 𝐳Ng\mathbf{z}_{N_{\rm g}} is the output of GD, 𝐝\mathbf{d} is a random Gaussian vector that represents the perturbation introduced by the random walk, and the function Q()Q(\cdot) maps each element of its input to the closest QAM lattice point. The random walk prevents the algorithm from getting trapped in local minima along the GD direction. The QAM mapping discretizes the continuous update to obtain valid candidate samples.

III-A3 Acceptance Test

The MH criterion [25] is applied to accept the candidate sample with a probability given by

α=min{1,exp(𝐲𝐇𝐱[cand]2)exp(𝐲𝐇𝐱t12)},\alpha=\min\left\{1,\frac{\exp(-\|\mathbf{y}-\mathbf{H}\mathbf{x}^{{\rm[cand]}}\|^{2})}{\exp(-\|\mathbf{y}-\mathbf{H}\mathbf{x}_{t-1}\|^{2})}\right\}, (6)

which approximately aligns the Markov chain’s stationary distribution with the target posterior distribution. This acceptance probability of 𝐱[cand]\mathbf{x}^{{\rm[cand]}} can be achieved by the following rule:

𝐱t={𝐱[cand], if αν (accept),𝐱t1, otherwise (reject),\mathbf{x}_{t}=\left\{\begin{array}[]{ll}\mathbf{x}^{\rm[cand]},&\text{ if }\alpha\geq\nu\text{ (accept)},\\ \mathbf{x}_{t-1},&\text{ otherwise (reject)},\end{array}\right. (7)

where ν\nu is drawn from the uniform distribution 𝒰(0,1)\mathcal{U}(0,1).

Running the gradient-based MCMC sampling for several iterations, a sample list 𝒳={𝐱t}t=0S\mathcal{X}=\{\mathbf{x}_{t}\}_{t=0}^{S} containing important samples for inference would be identified. This sample list can be used to generate the hard decision by choosing the best sample according to

𝐱^=argmin𝐱𝒳𝐲𝐇𝐱2.\hat{\mathbf{x}}=\underset{\mathbf{x}\in\mathcal{X}}{\arg\min}\;\|\mathbf{y}-\mathbf{H}\mathbf{x}\|^{2}. (8)
Refer to caption
Figure 2: Mini-batch gradient-based MCMC detector implemented using the star DBP topology. The implementaion using the daisy-chain topology is similar and thus omitted.

III-B Mini-Batch Gradient-Based MCMC for Decentralized MIMO Detection

Next, we present the decentralized gradient-based MCMC algorithm tailored for the DBP architecture. The central concept revolves around distributing the resource-intensive gradient computations across multiple DUs, thereby diminishing computation latency and bolstering overall efficiency.

We first provide some useful notations for ease of illustration. The local objective function at each DU c[C]c\in[C] is defined as

fc(𝐱)=12𝐲c𝐇c𝐱2.f_{c}(\mathbf{x})=\frac{1}{2}\|\mathbf{y}_{c}-\mathbf{H}_{c}\mathbf{x}\|^{2}. (9)

Therefore, the local gradient is given by

fc(𝐱)=𝐇cH(𝐲c𝐇c𝐱).\nabla f_{c}(\mathbf{x})=-\mathbf{H}_{c}^{\rm H}(\mathbf{y}_{c}-\mathbf{H}_{c}\mathbf{x}). (10)

The optimal MIMO detection problem in (2) is rewritten as

𝐱\displaystyle\mathbf{x}^{\ast} =argmin𝐱𝒜U×1f(𝐱),\displaystyle=\underset{\mathbf{x}\in\mathcal{A}^{U\times 1}}{\arg\min}\;f(\mathbf{x}), (11)

where the global objective function f(𝐱)f(\mathbf{x}) is given by

f(𝐱)12𝐲𝐇𝐱2=c=1Cfc(𝐱).f(\mathbf{x})\triangleq\frac{1}{2}\|\mathbf{y}-\mathbf{H}\mathbf{x}\|^{2}=\sum_{c=1}^{C}f_{c}(\mathbf{x}). (12)

The decomposition of the global objective function into multiple local objective functions allows the calculation of gradients on each DU as shown in (10). These local gradients can then be collected at the CU to derive the full gradient f(𝐱)=c=1Cfc(𝐱)\nabla f(\mathbf{x})=\sum_{c=1}^{C}\nabla f_{c}(\mathbf{x}), which enables the GD-based optimization in (4). The distributed computations alleviate the computation burden of directly calculating the full gradient at the CU.

However, the decentralization of the standard full-batch GD in (4) does not lead to a reduction in the overall computation cost. Moreover, this decentralized full-batch GD requires information exchange between the CU and all DUs in each iteration, resulting in high interconnection costs. To further mitigate the computation and interconnection requirements associated with the full-batch GD, we turn to the mini-batch stochastic GD method [32], which shows promise in efficiently optimizing objective functions in finite-sum form as (12). This method serves as an efficient alternative to the full-batch GD in (4), offering the potential for rapid convergence and overhead reduction. Specifically, we introduce a mini-batch-based gradient updating rule tailored to the DBP architecture, as detailed below.

In each GD iteration kk, the estimate is not updated using the full gradient f()\nabla f(\cdot). Instead, a subset of DUs, termed a mini-batch and denoted as \mathcal{I}, is randomly selected for the update. The mini-batch contains m=||m=|\mathcal{I}| elements. Here, mm represents the batch size, and we assume that the number of antenna clusters CC is divisible by mm. The mini-batch-based GD iteration operates as

𝐳k=𝐳k1τf(𝐳k1),\mathbf{z}_{k}=\mathbf{z}_{k-1}-\tau\nabla f_{\mathcal{I}}(\mathbf{z}_{k-1}), (13)

where f()\nabla f_{\mathcal{I}}(\cdot) denotes the estimated gradient based on the mini-batch \mathcal{I}. This is expressed as

f(𝐳k1)=Cmcfc(𝐳k1),\nabla f_{\mathcal{I}}(\mathbf{z}_{k-1})=\frac{C}{m}\sum_{c\in\mathcal{I}}\nabla f_{c}(\mathbf{z}_{k-1}), (14)

where the scaling factor C/mC/m ensures the unbiasedness of the mini-batch gradient as an estimator of the full gradient [32], denoted as,

𝔼[f(𝐱)]=f(𝐱),\mathbb{E}[\nabla f_{\mathcal{I}}(\mathbf{x})]=\nabla f(\mathbf{x}), (15)

with the expectation taken across any possible selections of mini-batch \mathcal{I}. This guarantees the asymptotic convergence to the minimum of the continuous-relaxed version of problem (11) via the mini-batch GD method.

Remark 1

As shown in (14), only a portion of all DUs are needed for gradient computations and information exchange with the CU per mini-batch GD iteration, thus providing significant design flexibility for balancing performance and complexity by adjusting the mini-batch size. Moreover, simulation results reveal that employing mini-batch-based gradients with a sufficiently large mini-batch size (e.g., m=C/2m=C/2) leads to virtually identical performance compared to using full gradients. Hence, the proposed approach enables substantial savings in computational resources and interconnection bandwidth.

With the proposed mini-batch-based gradient updating rule in mind, we next introduce the Mini-NAG-MCMC algorithm for decentralized MIMO detection. For ease of presentation, in the remainder of this paper, we consider the star topology as displayed in Fig. 2 unless noted otherwise. We assume that a total of SS samples are generated to perform the final inference for MIMO detection. The generation of each sample 𝐱t\mathbf{x}_{t} is divided into two stages as detailed in the following.

III-B1 Mini-Batch GD Stage

In the mini-batch GD stage, we select NAG as the GD component due to its fastest convergence rate among the first-order gradient-based optimization algorithms [28]. NAG incorporates momentum in the update rule to expedite and stabilize convergence. To integrate the NAG method into the DBP architecture, in each GD iteration kk, the incorporation of momentum is performed at the CU to derive the intermediate point 𝒑k\boldsymbol{p}_{k} as

𝒑k=𝐳k1+ρkΔ𝐳k1,\boldsymbol{p}_{k}=\mathbf{z}_{k-1}+\rho_{k}\Delta\mathbf{z}_{k-1}, (16)

where Δ𝐳k1=𝐳k1𝐳k2\Delta\mathbf{z}_{k-1}=\mathbf{z}_{k-1}-\mathbf{z}_{k-2} is the previous update stored at the CU (Δ𝐳0=𝟎\Delta\mathbf{z}_{0}=\mathbf{0}). The momentum factor ρk[0,1]\rho_{k}\in[0,1] determines the proportion of the previous update that is added to the current update.

The intermediate point 𝒑k\boldsymbol{p}_{k} is then broadcast to DUs belonging to the randomly chosen mini-batch \mathcal{I}. Upon receiving the broadcast, each DU cc\in\mathcal{I} computes the local gradient fc(𝒑k)\nabla f_{c}(\boldsymbol{p}_{k}) in parallel based on (10). These local gradients are then uploaded and aggregated for performing the descent from 𝒑k\boldsymbol{p}_{k} and updating the momentum at the CU:

𝐳k\displaystyle{\mathbf{z}}_{k} =𝒑kτCmcfc(𝒑k),\displaystyle={\boldsymbol{p}}_{k}-{\tau\frac{C}{m}}\sum_{c\in\mathcal{I}}\nabla f_{c}(\boldsymbol{p}_{k}), (17)
Δ𝐳k\displaystyle\Delta{\mathbf{z}}_{k} =𝐳k𝐳k1.\displaystyle={\mathbf{z}}_{k}-{\mathbf{z}}_{k-1}. (18)

The NAG procedure is iteratively conducted until a preset number of iterations NgN_{\rm g} is reached. The output of NAG, 𝐳Ng\mathbf{z}_{N_{\rm g}}, is then used as the input for the subsequent MCMC sampling.

One critical parameter that requires careful selection in the NAG is the learning rate τ\tau. From [28], a general learning rate choice for an LL-Lipschitz smooth objective function f()f(\cdot) is

0<τ1L,0<\tau\leq\frac{1}{L}, (19)

where LL is the Lipschitz constant, and choosing the upper bound τ=1L\tau=\frac{1}{L} is beneficial for convergence. For the objective function f()f(\cdot) in (12), we have

L=λmax(𝐆),L=\lambda_{\max}(\mathbf{G}), (20)

where 𝐆=𝐇H𝐇\mathbf{G}=\mathbf{H}^{\rm H}\mathbf{H} is the global Gram matrix, and λmax()\lambda_{\max}(\cdot) takes the largest eigenvalue of 𝐆\mathbf{G} [27, 33]. To avoid the computationally expensive eigenvalue decomposition, it is recommended in [27] to replace λmax(𝐆)\lambda_{\max}(\mathbf{G}) with its upper bound 𝐆F\|\mathbf{G}\|_{F} and set the learning rate as

τ=1𝐆F.\tau=\frac{1}{\|\mathbf{G}\|_{F}}. (21)

A straightforward way to derive τ\tau in (21) under the DBP architecture is to accumulate the local Gram matrices from all DUs, forming the global Gram matrix at the CU for calculating (21). Specifically, during the preprocessing period, each DU cc computes the local Gram matrix,

𝐆c=𝐇cH𝐇c,c[C],\mathbf{G}_{c}=\mathbf{H}_{c}^{\rm H}\mathbf{H}_{c},\;c\in[C], (22)

which is then collected at the CU to derive the global Gram matrix, 𝐆=c=1C𝐆c\mathbf{G}=\sum_{c=1}^{C}\mathbf{G}_{c}.

Under the so-called favorable propagation in massive MIMO systems [34], the computation of τ\tau can be further simplified. In such cases, we have

1B𝐡iH𝐡j0,ij,i,j[U],B,\frac{1}{B}\mathbf{h}_{i}^{\rm H}\mathbf{h}_{j}\to 0,\;i\neq j,\;i,j\in[U],\quad B\to\infty, (23)

where 𝐡i\mathbf{h}_{i} denotes the ii-th column vector of 𝐇\mathbf{H}. This equation indicates that, for MIMO uplink with a large number of BS antennas, the Gram matrix 𝐆\mathbf{G} is diagonally dominant and can be approximated by its diagonal counterpart 𝐃\mathbf{D} given as

𝐃=diag(𝐆)=diag(c=1C𝐆c)=c=1C𝐃c,\mathbf{D}={\rm diag}(\mathbf{G})={\rm diag}\left(\sum_{c=1}^{C}\mathbf{G}_{c}\right)=\sum_{c=1}^{C}\mathbf{D}_{c}, (24)

where 𝐃cU×U\mathbf{D}_{c}\in\mathbb{R}^{U\times U} is the diagonal counterpart of the local Gram matrix 𝐆c\mathbf{G}_{c} and can be calculated using the column vector 𝐡u,c,u[U]\mathbf{h}_{u,c},u\in[U] of 𝐇c\mathbf{H}_{c}:

(𝐃c)ij={𝐡u,c2, when i=j=u0, otherwise .(\mathbf{D}_{c})_{ij}=\left\{\begin{array}[]{ll}\|\mathbf{h}_{u,c}\|^{2},&\text{ when }i=j=u\\ 0,&\text{ otherwise }\end{array}\right.. (25)

Finally, the learning rate τ\tau in (21) can be approximated as

τ1𝐃F=1i=1U(𝐃)ii2.\tau\approx\frac{1}{\|\mathbf{D}\|_{F}}=\frac{1}{\sqrt{\sum_{i=1}^{U}(\mathbf{D})_{ii}^{2}}}. (26)
Remark 2

With the approximation 𝐆𝐃\mathbf{G}\approx\mathbf{D}, the complexity on the order of 𝒪(BcU2)\mathcal{O}(B_{c}U^{2}) for calculating (22) at each DU is substituted by the low complexity of 𝒪(BcU)\mathcal{O}(B_{c}U) for calculating (25). Meanwhile, only the UU real-valued diagonal elements of 𝐃c\mathbf{D}_{c} are required to be exchanged and accumulated, thereby alleviating the high interconnection bandwidth to convey the entire Gram matrix.

III-B2 MCMC Sampling Stage

After a sufficient GD procedure to refine the estimate, the candidate sample is generated at the CU by adding random perturbation to 𝐳Ng{\mathbf{z}}_{N_{\rm g}} and mapping the perturbed estimate to the QAM lattice using (5). The random Gaussian vector 𝐝\mathbf{d} in (5) is given by

𝐝=γ𝐌c𝐰,\mathbf{d}=\gamma\mathbf{M}_{\rm c}\mathbf{w}, (27)

where γ\gamma is the magnitude of the random perturbation, i.e., step size of the random walk, 𝐌cU×U\mathbf{M}_{\rm c}\in\mathbb{C}^{U\times U} is the Cholesky factor specifying the covariance of the random perturbation as 𝐌c𝐌cH\mathbf{M}_{\rm c}\mathbf{M}_{\rm c}^{\rm H}, and 𝐰\mathbf{w} follows 𝒞𝒩(𝟎,𝐈)\mathcal{CN}(\mathbf{0},\mathbf{I}). The step size γ\gamma is selected to enable sufficient ability to escape local minima while maintaining a reasonable acceptance probability [35]. Moreover, according to MCMC literature [36, 37], 𝐌c\mathbf{M}_{\rm c} can be set to specify a covariance proportional to the inverse of the Hessian matrix of the objective function. However, this setting is computationally expensive. Experimental observations indicate that using an identity matrix 𝐈\mathbf{I} for 𝐌c\mathbf{M}_{\rm c} causes almost no performance degradation compared to the former setting, while substantially reducing the computational cost.

Subsequently, the CU performs the MH acceptance test, where the candidate sample 𝐱[cand]\mathbf{x}^{[\rm cand]} is accepted with a probability computed according to (6). This probability can be alternatively expressed as

α\displaystyle\alpha =min{1,exp(2f(𝐱t1)2f(𝐱[cand]))}.\displaystyle=\min\left\{1,\exp\left(2f(\mathbf{x}_{t-1})-2f(\mathbf{x}^{[\mathrm{cand}]})\right)\right\}. (28)

Evaluation of the global objective function f()f(\cdot) requires the knowledge of 𝐇\mathbf{H} and 𝐲\mathbf{y}, which is not directly accessible to the CU. Instead, we evaluate the local objective function at each DU and then aggregate them at the CU. Specifically, for the acceptance test of the tt-th sampling iteration, the CU first broadcasts the candidate sample 𝐱[cand]\mathbf{x}^{\rm[cand]} to all the DUs. Then, each DU independently evaluates the local objective function in (9) and uploads the value fc(𝐱[cand])f_{c}(\mathbf{x}^{\rm[cand]}) to the CU for aggregation to derive

f(𝐱[cand])=c=1Cfc(𝐱[cand]).f(\mathbf{x}^{\rm[cand]})=\sum_{c=1}^{C}f_{c}(\mathbf{x}^{\rm[cand]}). (29)

This global objective function value is subsequently used to compute the acceptance probability α\alpha in (28). The value f(𝐱[cand])f(\mathbf{x}^{\rm[cand]}) is stored or discarded based on the acceptance or rejection of the candidate sample.

f(𝐱t)={f(𝐱[cand]), when 𝐱[cand] is acceptedf(𝐱t1), otherwise .f(\mathbf{x}_{t})=\left\{\begin{array}[]{ll}f(\mathbf{x}^{\rm[cand]}),&\text{ when }\mathbf{x}^{\rm[cand]}\text{ is accepted}\\ f(\mathbf{x}_{t-1}),&\text{ otherwise }\end{array}\right.. (30)

Finally, these global objective function values {f(𝐱t)}t=0S\{f(\mathbf{x}_{t})\}_{t=0}^{S} for the drawn samples are used for generating the final estimate.

Input: 𝐇c\mathbf{H}_{c}, 𝐲c\mathbf{y}_{c}, c[C]c\in[C], mini-batch size mm, momentum factor ρk\rho_{k}, random walk step size γ\gamma and covariance 𝐌c𝐌cH\mathbf{M}_{\rm c}\mathbf{M}_{\rm c}^{\rm H}, number of NAG iterations NgN_{\rm g}, number of sampling iterations SS.
Preprocessing:
Each DU c[C]c\in[C]:
  Calculate 𝐃c\mathbf{D}_{c} via (25) and upload its diagonal to the CU.
CU:
  Initial sample 𝐱0\mathbf{x}_{0} and f(𝐱0)f(\mathbf{x}_{0}). Calculate the learning rate τ\tau via
  (24) and (26).
Core: for t=1t=1 to SS do
       Mini-batch GD stage:
       // Perform NgN_{\rm g} NAG iterations with 𝐳0=𝐱t1{\mathbf{z}}_{0}=\mathbf{x}_{t-1}.
       for k=1k=1 to NgN_{\rm g} do
             CU:
              Add momentum to derive 𝒑k\boldsymbol{p}_{k} via (16);
              Randomly select a mini-batch \mathcal{I} and broadcast 𝒑k\boldsymbol{p}_{k} to the
              DUs within \mathcal{I}.
             Each DU cc\in\mathcal{I}:
              Compute local gradients fc(𝒑k)\nabla f_{c}(\boldsymbol{p}_{k}) via (10) and upload it
              to the CU.
             CU:
              Aggregate fc(𝒑k)\nabla f_{c}(\boldsymbol{p}_{k}) to update 𝐳k\mathbf{z}_{k} and Δ𝐳k\Delta\mathbf{z}_{k} via (17) and
              (18).
      
      MCMC sampling stage:
             CU:
              Derive 𝐱[cand]\mathbf{x}^{[\rm cand]} via (5) and (27);
              Broadcast 𝐱[cand]\mathbf{x}^{[\rm cand]} to all the DUs.
             Each DU c[C]c\in[C]:
              Calculate fc(𝐱[cand])f_{c}(\mathbf{x}^{[\rm cand]}) via (9) and upload it to the CU.
             CU:
              Aggregate fc(𝐱[cand])f_{c}(\mathbf{x}^{[\rm cand]}) to derive f(𝐱[cand])f(\mathbf{x}^{[\rm cand]}) via (29);
              Calculate α\alpha via (28) and ν𝒰(0,1)\nu\sim\mathcal{U}(0,1);
              If αν\alpha\geq\nu, set 𝐱t=𝐱[cand]\mathbf{x}_{t}=\mathbf{x}^{[\rm cand]}; otherwise, set 𝐱t=𝐱t1\mathbf{x}_{t}=\mathbf{x}_{t-1};
              If αν\alpha\geq\nu, set f(𝐱t)=f(𝐱[cand])f(\mathbf{x}_{t})=f(\mathbf{x}^{[\rm cand]}); otherwise, set
            f(𝐱t)=f(𝐱t1)f(\mathbf{x}_{t})=f(\mathbf{x}_{t-1}).
      
Output: 𝐱^=argmin𝐱𝒳f(𝐱)\hat{\mathbf{x}}=\underset{\mathbf{x}\in\mathcal{X}}{\arg\min}\;f(\mathbf{x}), where 𝒳={𝐱t}t=0S\mathcal{X}=\{\mathbf{x}_{t}\}_{t=0}^{S}.
Algorithm 1 Mini-NAG-MCMC

The above two stages are alternatively executed SS times for the Markov chain to converge to the steady state and derive a sufficient number of samples that follow the posterior distribution, forming the sample list 𝒳\mathcal{X} for statistical inference at the CU. Moreover, parallel samplers can be deployed to reduce sampling latency and correlation. In hard decision, the final estimate 𝐱^\hat{\mathbf{x}} can be obtained by selecting the sample from 𝒳\mathcal{X} that minimizes the objective function f()f(\cdot). Details of the Mini-NAG-MCMC algorithm are summarized in Algorithm 1. In the Appendix, we provide the proof of the convergence of the proposed algorithm in a simplified setting.

Remark 3

It is worth mentioning that the mini-batch gradient-based MCMC method is also applicable to the daisy-chain topology. The difference lies in the exchange of information among processing units. Without loss of generality, we assume DU CC is within the selected mini-batch in both the star and daisy-chain topologies. For the daisy-chain topology, we assume the DUs are connected in the same order as shown in Fig. 1. Taking the information upload stage as an instance, DU CC in the star topology directly uploads its local gradient to the CU as shown in Fig. 2; By contrast, DU CC in the daisy-chain topology needs to accumulate the gradients of its own with those from the former DUs within the mini-batch to derive f()cfc()\nabla f_{\mathcal{I}}(\cdot)\propto\sum_{c\in\mathcal{I}}\nabla f_{c}(\cdot) and then uploads it to the CU. This sequential information exchange results in an increased data transfer latency.

Remark 4

Note that soft outputs can also be generated based on the sample list, as shown in [20, 21, 27]. The detailed derivations are beyond the scope of this paper and thus omitted.

Remark 5

Although the use of multiple computing units in decentralized detection inevitably results in additional energy consumption, the proposed decentralized detector allows for energy-efficient implementation by significantly reducing the computation of each DU. This enables DUs to operate in a more energy-efficient region with a lower computational density compared to the CU in the centralized scheme. Moreover, the mini-batch-based gradient updating rule can further increase energy efficiency by involving only a subset of DUs in gradient computations per iteration while keeping others inactive.

IV Complexity Analysis

In this section, we analyze the complexity of the proposed Mini-NAG-MCMC from two aspects: computational complexity and interconnection bandwidth.

TABLE I: Computational Complexity of Centralized and Decentralized Detection Algorithms
Algorithms Computational complexity
Each DU CU
Decentralized Mini-NAG-MCMC 𝒪(BcU(NgS+M))\mathcal{O}\big{(}B_{c}U(N_{\rm g}S+\sqrt{M})\big{)} 𝒪(SU(Ng+M))\mathcal{O}\big{(}SU(N_{\rm g}+\sqrt{M})\big{)}
DeADMM BcUB_{c}\geq U 𝒪(TU2+U3+BcU2)\mathcal{O}\big{(}TU^{2}+U^{3}+B_{c}U^{2}\big{)} 𝒪(TU)\mathcal{O}(TU)
DeADMM Bc<UB_{c}<U 𝒪(T(Bc2+BcU)+Bc3)\mathcal{O}\big{(}T(B_{c}^{2}+B_{c}U)+B_{c}^{3}\big{)} 𝒪(TU)\mathcal{O}(TU)
DeLAMA 𝒪(T(U2+MU)+BcU2)\mathcal{O}\big{(}T(U^{2}+\sqrt{M}U)+B_{c}U^{2}\big{)} 𝒪(CU)\mathcal{O}(CU)
DeEP 𝒪(TU3+BcU2)\mathcal{O}(TU^{3}+B_{c}U^{2}) 𝒪(TU(M+C))\mathcal{O}\big{(}TU(\sqrt{M}+C)\big{)}
DeNewton 𝒪(TBcU+BcU2)\mathcal{O}(TB_{c}U+B_{c}U^{2}) 𝒪(TU2+U3)\mathcal{O}(TU^{2}+U^{3})
Centralized LMMSE 𝒪(U3+BU2)\mathcal{O}(U^{3}+BU^{2})
KK-best 𝒪(BU2+KMU2)\mathcal{O}(BU^{2}+K\sqrt{M}U^{2})
NAG-MCMC 𝒪(S(NgU2+MU+B)+BU2+MBU)\mathcal{O}\big{(}S(N_{\rm g}U^{2}+\sqrt{M}U+B)+BU^{2}+\sqrt{M}BU\big{)}
  • Note: TT is the number of iterations of the iterative detection schemes; KK is the list size of the KK-best detector; 𝐌c=𝐈\mathbf{M}_{\rm c}=\mathbf{I} is used for NAG-MCMC and Mini-NAG-MCMC.

IV-A Computational Complexity

Table I presents the computational complexity of Mini-NAG-MCMC compared with various centralized and decentralized detection algorithms. The decentralized baselines include the decentralized ADMM (DeADMM) [6], LAMA (DeLAMA)333We consider the fully decentralized version of the LAMA detector proposed in [12]. [12], EP (DeEP) [14], and Newton (DeNewton) [9] detectors. Centralized LMMSE [38], KK-best sphere decoding [40], and NAG-MCMC [27] algorithms are also compared. For the considered decentralized schemes, operations in the CU and each DU are counted separately. We only present the count of a single DU to reflect the parallel processing of the DBP architecture. We remark that while this count is lower than the total computational cost from all local processing units, it reflects the computation latency of the algorithm.

In the comparison, the complexity of a U×UU\times U matrix inversion is 𝒪(U3)\mathcal{O}(U^{3}), where 𝒪()\mathcal{O}(\cdot) denotes the standard asymptotic notation [39, Ch. 3.1] to indicate the order of complexity. The complexity of Mini-NAG-MCMC is dominated by gradient computations and objective function evaluation, costing 𝒪(NgSBcU)\mathcal{O}(N_{\rm g}SB_{c}U) and 𝒪(MBcU)\mathcal{O}(\sqrt{M}B_{c}U) at each DU, respectively. The other operations in the MCMC sampling stage, such as the random walk, QAM mapping, and acceptance test, incur a modest cost of 𝒪(SU(Ng+M))\mathcal{O}\big{(}SU(N_{\rm g}+\sqrt{M})\big{)} multiplications at the CU. The comparison in Table I shows that Mini-NAG-MCMC is highly efficient due to avoiding high-complexity matrix multiplications and inversions.

Table I also reveals that the proposed Mini-NAG-MCMC effectively relieves the computation burden at the CU compared to the centralized algorithms by shifting the computationally intensive gradient computations to the DUs. As a result, the computational complexity at each processing unit is independent of the number of BS antennas BB, which is beneficial for reducing computation latency. Furthermore, the proposed Mini-NAG-MCMC possesses the advantage that only a portion of the DUs is required to perform computations in each mini-batch GD iteration of the algorithm. Thus, the total computational cost can be saved to a large extent compared to other decentralized schemes.

Note that the number of iterations also has a significant impact on the complexity of the proposed method and other iterative detection schemes in Table I. However, it is practically difficult to provide an analytical estimate of the number of iterations required to reach a certain performance level. Hence, we resort to empirically evaluating the computational complexity of the detectors listed in Table I and meanwhile taking into account the achieved performance. Detailed results are shown in Sec. V.

IV-B Interconnection Bandwidth

The interconnection bandwidth is measured by the number of bits transferred from and to the CU within a symbol period. Denote by ω\omega the bit-width of each real value (real/imaginary parts of a complex scalar number) in the implementation. Note that the representation of each QAM symbol in the derived samples only requires log2M\log_{2}M bits, which can be significantly smaller than ω\omega. For centralized schemes, the entities required to be uploaded to the CU include the full channel matrix 𝐇\mathbf{H} and received signal 𝐲\mathbf{y}, resulting in the bandwidth

Ccentralized=2(BU+B)ω.C_{\text{centralized}}=2(BU+B)\omega. (31)

For the star topology-based Mini-NAG-MCMC, the CU receives UCUC and 2NgSUm+(S+1)C2N_{\rm g}SUm+(S+1)C real values during preprocessing and sampling iterations, respectively. The CU also broadcasts 2NgSUm2N_{\rm g}SUm continuous-valued real numbers and (S+1)UC(S+1)UC QAM symbols during the sampling. For the daisy-chain topology-based Mini-NAG-MCMC, the numbers that are required to be transferred are identical to those in the star topology. However, the CU requires communication with only one directly connected DU, thereby significantly reducing interconnection costs. Hence, the interconnection bandwidth of Mini-NAG-MCMC with star and daisy-chain topologies are respectively given by

CMinistar=\displaystyle C_{\text{Mini}}^{\text{star}}= 4NgSUωm+(S+1)Ulog2MC\displaystyle 4N_{\rm g}SU\omega\cdot m+(S+1)U\log_{2}M\cdot C
+(S+1+U)ωC,\displaystyle+(S+1+U)\omega\cdot C, (32)
CMinichain=\displaystyle C_{\text{Mini}}^{\text{chain}}= 4NgSUω+(S+1)Ulog2M+(S+1+U)ω.\displaystyle 4N_{\rm g}SU\omega+(S+1)U\log_{2}M+(S+1+U)\omega. (33)

The results of both topologies do not depend on the BS antenna count BB, a notable advantage over centralized schemes.

We note that if we consider the interconnection cost on all links in the daisy-chain topology, the total interconnection bandwidth would be comparable to that of the star topology. However, the daisy-chain architecture can uniformly allocate the bandwidth demand among all links, resulting in a significant reduction in the individual requirements on each link [7]. Notwithstanding, the price to pay is extended processing latency as a result of the sequential processing architecture.

V Simulation Results

In this section, we present numerical results to evaluate the proposed Mini-NAG-MCMC in terms of error rate performance, computational complexity, and interconnection cost.

V-A Simulation Setup

Both IID Rayleigh fading and realistic 3GPP channel models are considered in the simulation. The Rayleigh fading MIMO channel has elements that follow the Gaussian distribution 𝒞𝒩(0,1/B)\mathcal{CN}(0,1/B). For realistic channels, the 3D channel model outlined in the 3GPP 36.873 technical report [41] is utilized. This model has been widely adopted for its accurate representation of real-world channel environments. We consider both uncoded and coded MIMO systems. For uncoded systems, we evaluate the bit error rate (BER) performance until 5×1075\times 10^{7} bits are transmitted or the number of bit errors exceeds 1000. Different independent channel realizations are considered for each symbol transmission. For coded systems, we employ a rate-3/43/4 low-density parity-check (LDPC) code and the OFDM scheme, wherein each user’s coded bits are converted to QAM symbols and mapped onto 216 subcarriers for data transmission. The detector calculates soft outputs using max-log approximation [20]. Belief propagation decoding is then utilized with a maximum number of 10 iterations. The block error rate (BLER) is evaluated until 10510^{5} coded blocks are transmitted with independent channel realizations.

We compare our method with the centralized LMMSE and NAG-MCMC [27] schemes and representative decentralized detector baselines as listed in Table I. The optimal ML [3] and near-optimal KK-best [40] detectors are used as performance benchmarks. For parameter settings of NAG-MCMC and the proposed Mini-NAG-MCMC, the number of NAG iterations is set to Ng=4N_{\rm g}=4. The choice of momentum factor ρk\rho_{k} follows the recommended setting in [28]:

μ0=1,μk=1+1+4μk122,ρk=μk11μk,k[Ng].\mu_{0}=1,\;\mu_{k}=\frac{1+\sqrt{1+4\mu_{k-1}^{2}}}{2},\;\rho_{k}=\frac{\mu_{k-1}-1}{\mu_{k}},\;k\in[N_{\rm g}]. (34)

The random walk step size is empirically set to γ=0.05\gamma=0.05. To ensure computational efficiency, we set 𝐌c=𝐈\mathbf{M}_{\rm c}=\mathbf{I} in (27). The initial sample 𝐱0\mathbf{x}_{0} for both NAG-MCMC and Mini-NAG-MCMC is randomly selected from 𝒜U×1\mathcal{A}^{U\times 1}.

V-B Convergence Analysis

Refer to caption
(a) Under different mini-batch sizes mm (C=32C=32)
Refer to caption
(b) Under different number of clusters CC (m=C/2m=C/2)
Figure 3: Convergence performance for a MIMO system configuration of B=128B=128 and U=8U=8 and 3232 with 16-QAM modulation and SNR=5dB{\text{SNR}=\text{5}\;\text{dB}} under IID Rayleigh fading channels.

Fig. 3 presents the convergence performance of Mini-NAG-MCMC in terms of the BER versus the number of sampling iterations.444Note that conducting a rigorous theoretical analysis of convergence is challenging for gradient-based MCMC methods in discrete search spaces [22], [26]. Therefore, we have attempted to empirically evaluate the convergence property via numerical simulations, aiming to provide some valuable insights. The MIMO system is configured with B=128B=128 and U=8U=8 or 3232. The system employs 16-QAM modulation and operates under IID Rayleigh fading channels.

Fig. 3(a) shows the convergence behavior under different choices of mini-batch sizes. We also provide the convergence performance of the centralized NAG-MCMC algorithm as a reference. The results indicate that Mini-NAG-MCMC converges faster as the mini-batch size mm increases in both U=8U=8 and U=32U=32 cases. This result is in accordance with expectation because the variance of the mini-batch gradient approximation in (14) is a decreasing function of mm [42]. Meanwhile, the convergence speed is virtually unaffected by the number of users, revealing the robustness of Mini-NAG-MCMC to varying user loads.

Remarkably, the Mini-NAG-MCMC with m=16m=16 achieves virtually the same convergence performance as the centralized NAG-MCMC that uses full gradients. This phenomenon can be attributed to the sufficient accuracy of the mini-batch approximation when mm is sufficiently large (e.g., m=C/2m=C/2 in this simulation) and the synergistic effect of stochastic gradients and MCMC in escaping local minima. Consequently, the proposed method achieves equivalent performance to the centralized counterpart that computes full gradients, while offering the benefit of reduced computational cost since only half of all the DUs are utilized for computations per iteration.

Fig. 3(b) compares the convergence performance under different numbers of clusters, with the mini-batch size fixed as m=C/2m=C/2. The figure reveals that the convergence speed shows minimal variation with respect to the number of clusters, which is a notable advantage of the proposed Mini-NAG-MCMC. Hence, the proposed method can accommodate to different levels of decentralization, allowing for trade-offs between the computational complexity at each DU and the number of interconnection interfaces (as well as bandwidth) by flexibly adjusting the system parameter CC according to different requirements.

V-C Uncoded Detection Performance

In this subsection, we investigate the proposed method under Rayleigh fading MIMO channels in uncoded systems. Fig. 4 provides the BER performance and computational complexity for a massive MIMO system with B=128B=128 BS antennas, U=32U=32 users, C=32C=32 clusters, and 16-QAM modulation. The mini-batch size for Mini-NAG-MCMC is m=16m=16. We compare the proposed method with centralized LMMSE [38], KK-best [40], and NAG-MCMC [27] detectors and two representative decentralized baselines, namely the DeADMM [6] and DeEP [14] detectors.555For the implementation of the DeADMM scheme, we use the code released at https://github.com/VIP-Group/DBP. The list size for the KK-best detector is set to K=4K=4, serving as a near-optimal baseline. The number of iterations for DeADMM and DeEP is set as T=50T=50 and T=5T=5, respectively, which is necessitated by the convergence of the algorithm in the considered system setup.

Refer to caption
(a) BER performance
Refer to caption
(b) Computational complexity
Figure 4: BER performance and computational complexity for a MIMO system configuration of B=128B=128, U=32U=32, and C=32C=32 with 16-QAM modulation under Rayleigh fading channels. The bar chart counts the number of real-valued multiplications at the CU and a single DU for decentralized schemes.
Refer to caption
(a) BER performance
Refer to caption
(b) Computational complexity
Figure 5: BER performance and computational complexity for a MIMO system configuration of B=128B=128, U=48U=48, and C=32C=32 with 16-QAM modulation under Rayleigh fading channels.

Fig. 4(a) shows that the LMMSE and DeADMM detectors have a significant performance gap to the ML detector. In contrast, Mini-NAG-MCMC achieves a noticeable performance gain over LMMSE and DeADMM when the number of sampling iterations is S=6S=6 and further approaches the ML performance with a slight increase in SS. Furthermore, Mini-NAG-MCMC obtains comparable BER performance to NAG-MCMC when both methods are equipped with the same SS.

Fig. 4(b) illustrates the complexity bar chart of different detectors in terms of the number of real-valued multiplications required per detection. The number of multiplications at the CU for Mini-NAG-MCMC is much lower than the counterparts at the CU for the centralized LMMSE, KK-best, and NAG-MCMC detectors. We remark that despite the total computational cost of Mini-NAG-MCMC being comparable to the centralized LMMSE and NAG-MCMC, the proposed decentralized algorithm offers dual advantages: First, the complexity at each DU is notably low, effectively leading to a reduced computation latency. Second, the distribution of complexity among many DUs allows the use of cheap computing hardware within each processing unit, in contrast to the centralized algorithms that exert the entire complexity burden on the CU. Furthermore, the computational cost of Mini-NAG-MCMC is also similar to that of the DeADMM and substantially lower than that of the DeEP.

Considering its near-ML performance and low complexity at each unit, as shown in Fig. 4, our proposed Mini-NAG-MCMC demonstrates high promise in achieving an attractive trade-off between performance and complexity.

Fig. 5 shows the results when the number of users is increased to U=48U=48, with the other system settings remaining unchanged compared to Fig. 4. Fig. 5(a) indicates that the performance of LMMSE, DeADMM, and DeEP further degrades as compared to the near-optimal KK-best detector when the user number increases. However, Mini-NAG-MCMC demonstrates significant performance improvement over LMMSE when S=6S=6 and achieves comparable performance to the KK-best when S=12S=12, consistent with the results presented in Fig. 4(a). This finding reflects that the proposed Mini-NAG-MCMC and the associated learning rate approximation are robust to the increased user load. Moreover, upon comparing Fig. 5(b) to Fig. 4(b), it is observed that the complexity increase of Mini-NAG-MCMC is subtle in contrast to the LMMSE, KK-best, and DeEP detectors. Specifically, the complexity increase is 50% for Mini-NAG-MCMC, whereas the corresponding values for LMMSE, KK-best, and DeEP are 165%, 125%, and 230%, respectively, demonstrating the outstanding scalability of Mini-NAG-MCMC as revealed in Sec. IV-A.

Refer to caption
(a) 64-QAM
Refer to caption
(b) 256-QAM
Figure 6: BER performance for a MIMO system configuration of B=128B=128, U=48U=48, and C=32C=32 with 64-QAM/256-QAM modulation under Rayleigh fading channels.

Fig. 6 presents the BER performance under high-order 64-QAM/256-QAM modulation schemes. The other system configurations are the same as Fig. 5. In this experimental setup, the KK-best detector is configured with K=8K=8 to provide a near-ML baseline. The DeADMM and DeEP detectors execute for T=50T=50 and T=10T=10 iterations, respectively. The proposed Mini-NAG-MCMC is set up with S=24S=24 iterations and a mini-batch size of m=16m=16. The figure illustrates that the proposed method achieves a performance gain of more than 1.5 dB when compared to the LMMSE detector and the decentralized baselines and approaches the performance of the near-optimal KK-best. These results demonstrate that the proposed decentralized detector scales to high-order modulation.

Refer to caption
(a) 16-QAM
Refer to caption
(b) 64-QAM
Figure 7: BLER performance for a MIMO system configuration of B=128B=128, U=8U=8, and C=32C=32 with 16-QAM/64-QAM modulation and a rate-3/43/4 LDPC code under 3GPP MIMO channels.

V-D Coded Detection Performance

We further investigate the proposed method under the more realistic 3GPP 3D MIMO channel model [41] with coded transmissions. We examine the urban macrocell non-line-of-sight scenario. For the simulation of this scenario, the QuaDRiGa simulator [43] is employed. In the considered scenario, the BS is equipped with B=128B=128 antennas, which are uniformly partitioned into C=32C=32 clusters, to serve U=8U=8 users. The BS adopts a uniform planar antenna array with half-wavelength antenna spacing. The users are evenly dropped within a 120120^{\circ} cell sector centered around the BS, spanning a radius ranging from 10 m to 500 m. The carrier frequency is set as 2.53 GHz, and the channel spans over 20 MHz bandwidth. To cope with the complicated channel environments, the KK-best detector is set up with K=8K=8. The number of iterations for DeADMM and DeEP is selected as T=50T=50 and T=10T=10, respectively. The number of iterations and mini-batch size for Mini-NAG-MCMC are S=64S=64 and m=16m=16, respectively.

As shown in Fig. 7, Mini-NAG-MCMC demonstrates stable performance, exhibiting significant robustness to the realistic channel model. Specifically, Mini-NAG-MCMC outperforms LMMSE and other decentralized detectors, achieving an approximate 1 dB gain under both 16-QAM and 64-QAM modulation. Furthermore, the gap between Mini-NAG-MCMC and the near-optimal performance achieved under centralized schemes by the KK-best detector is a mere 0.4 dB. Considering the distribution of complexity across DUs, achieving near-optimal performance in this context is indeed promising.

Refer to caption
Figure 8: Performance gap versus interconnection bandwidth of Mini-NAG-MCMC (m=2m=2 and S[2,4,6,8]S\in[2,4,6,8]) for a MIMO system configuration of B=256B=256, U=8U=8, and C=8C=8 with 16-QAM modulation under Rayleigh fading channels.
Refer to caption
Figure 9: Percentage of interconnection bandwidth of the proposed Mini-NAG-MCMC (m=2m=2 and S=4S=4) relative to the centralized algorithm as a function of the number of BS antennas BB. The MIMO system configuration is U=8U=8 and C=8C=8 with 16-QAM modulation under Rayleigh fading channels.

V-E Interconnection Bandwidth Comparison

We evaluate the interconnection cost of the proposed Mini-NAG-MCMC in this subsection. The bit-width is selected as ω=16\omega=16 in the evaluation. The BER performance versus interconnection bandwidth of Mini-NAG-MCMC with star and daisy-chain topologies for a MIMO system with B=256B=256, U=8U=8, and C=8C=8 is presented in Fig. 8. The mini-batch size for Mini-NAG-MCMC is set to m=2m=2. In the figure, the xx-axis represents the percentage of interconnection bandwidth of Mini-NAG-MCMC relative to the centralized algorithm, calculated according to (31)-(33) by setting the number of sampling iterations SS to [2,4,6,8][2,4,6,8]. The yy-axis represents the performance gap between Mini-NAG-MCMC and the centralized ML algorithm in terms of the SNR required to achieve a target BER=103\text{BER}=10^{-3}. From the figure, we have the following observations:

  • Mini-NAG-MCMC achieves a flexible trade-off between performance and interconnection bandwidth.

  • The daisy-chain topology provides lower interconnection bandwidth than the star topology and is more appreciated in interconnection bandwidth-constrained scenarios, despite a larger processing delay.

Furthermore, Fig. 9 presents the interconnection bandwidth of Mini-NAG-MCMC compared with the centralized algorithm as a function of the number of BS antennas BB when U=8U=8 and C=8C=8. The mini-batch size and the number of sampling iterations for Mini-NAG-MCMC are m=2m=2 and S=4S=4, respectively. The figure shows that the proposed method with both star and daisy-chain topologies exhibits significantly reduced interconnection cost compared to the centralized processing as BB increases. This result confirms the scalability of the proposed method to massive MIMO systems with a large number of BS antennas.

V-F Trade-off between Performance and Complexity

Refer to caption
Figure 10: Performance-complexity trade-off for a MIMO system configuration of B=128B=128, U=32U=32, and C=4C=4 with 16-QAM modulation under Rayleigh fading channels. For decentralized detectors, the xx-axis represents the cumulative count of real-valued multiplications performed at the CU along with a single DU. The size of the circles represents the interconnection bandwidth. For the compared decentralized baselines, the interconnection bandwidth is calculated by counting the entities that need to be transferred in a similar way as the analysis in Sec. IV-B.

Fig. 10 presents the performance-complexity trade-off of various detectors for a massive MIMO system with B=128B=128, U=32U=32, C=4C=4, 16-QAM modulation, and Rayleigh fading channels. All the detectors are implemented using the star DBP topology. The figure reveals that the proposed Mini-NAG-MCMC (m=2m=2) achieves the most promising results among the considered detectors in the trade-off between BER performance, computational complexity, and interconnection bandwidth. Despite the higher interconnection bandwidth of Mini-NAG-MCMC compared to DeLAMA and DeEP, the gap can be narrowed by deploying the proposed detector using the daisy-chain topology. This result highlights the potential of the proposed method as a highly implementable and desirable solution for massive MIMO detectors under DBP architectures.

VI Conclusion

In this paper, the gradient-based MCMC method has been utilized for developing a novel decentralized detector for massive MIMO systems operating within the DBP architecture. Our approach involves decentralizing the gradient-based MCMC technique, distributing the computationally demanding gradient computations among multiple DUs for parallel processing. These DUs are coordinated by a CU that engages in MCMC sampling to derive a list of important samples for MIMO detection. This design enables a balanced allocation and efficient use of computational resources. We have also incorporated the concept of mini-batch GD in the design of the decentralized detector, leading to reduced computation and interconnection demands across DUs while preserving high performance. Through extensive evaluation, we have demonstrated the clear superiority of the proposed Mini-NAG-MCMC over existing decentralized detectors, notably in terms of BER. Furthermore, a complexity analysis solidifies the proposed method’s edge in effectively balancing performance, computational overhead, and interconnection bandwidth.

[Convergence of the Proposed Algorithm] We illustrate the proposed algorithm’s convergence in a simplified setting where the algorithm uses 𝐌c=𝐈\mathbf{M}_{\rm c}=\mathbf{I}, as employed in the experiments. Specifically, at each sampling iteration, a candidate state 𝐱\mathbf{x}^{\prime} is generated as follows

𝐱=Q(𝐱τf(𝐱)+γ𝐰),\mathbf{x}^{\prime}=Q(\mathbf{x}-\tau\nabla f_{\mathcal{I}}(\mathbf{x})+\gamma\mathbf{w}), (35)

where 𝐱\mathbf{x} denotes the previous state. By setting the global objective function as f(𝐱)12𝐲𝐇𝐱2f(\mathbf{x})\triangleq\frac{1}{2}\|\mathbf{y}-\mathbf{H}\mathbf{x}\|^{2}, we target the tempered posterior distribution given by

π(𝐱)\displaystyle\pi(\mathbf{x}) p(𝐱)p(𝐲|𝐱)1/Tp\displaystyle\propto p(\mathbf{x})p(\mathbf{y}|\mathbf{x})^{1/T_{\rm p}}
p(𝐱)exp(𝐲𝐇𝐱2),\displaystyle\propto p(\mathbf{x})\exp\left(-{\|\mathbf{y}-\mathbf{Hx}\|^{2}}\right), (36)

where Tp=1/σ2T_{\rm p}=1/\sigma^{2} is the temperature parameter [20, 44]. Assuming that the prior distribution p(𝐱)p(\mathbf{x}) is equiprobable, the standard MH criterion [31] calculates the acceptance probability of 𝐱\mathbf{x}^{\prime} as

A(𝐱|𝐱)\displaystyle A(\mathbf{x}^{\prime}|\mathbf{x}) =min{1,π(𝐱)q(𝐱|𝐱)π(𝐱)q(𝐱|𝐱)}\displaystyle=\min\left\{1,\frac{\pi(\mathbf{x}^{\prime})q(\mathbf{x}|\mathbf{x}^{\prime})}{\pi(\mathbf{x})q(\mathbf{x}^{\prime}|\mathbf{x})}\right\}
=min{1,exp(𝐲𝐇𝐱2)q(𝐱|𝐱)exp(𝐲𝐇𝐱2)q(𝐱|𝐱)}.\displaystyle=\min\left\{1,\frac{\exp(-\|\mathbf{y}-\mathbf{H}\mathbf{x}^{\prime}\|^{2})q(\mathbf{x}|\mathbf{x}^{\prime})}{\exp(-\|\mathbf{y}-\mathbf{H}\mathbf{x}\|^{2})q(\mathbf{x}^{\prime}|\mathbf{x})}\right\}. (37)

However, the proposal probability q(|)q(\cdot|\cdot) in (37), which corresponds to (35), is computationally expensive. Specifically, considering the discrete space 𝒜U×1\mathcal{A}^{U\times 1}, the proposal probability q(𝐱|𝐱)q(\mathbf{x}^{\prime}|\mathbf{x}) can be expressed as [45]

q\displaystyle q (𝐱|𝐱)=exp(1γ2𝐱𝐱+τf(𝐱)2)Z𝒜(𝐱)u=1U𝕀xu𝒜,\displaystyle(\mathbf{x}^{\prime}|\mathbf{x})=\frac{\exp\left(-\frac{1}{\gamma^{2}}\|\mathbf{x}^{\prime}-\mathbf{x}+\tau\nabla f_{\mathcal{I}}(\mathbf{x})\|^{2}\right)}{Z_{\mathcal{A}}(\mathbf{x})}\prod_{u=1}^{U}\mathbb{I}_{x_{u}^{\prime}\in\mathcal{A}}, (38)

where xux_{u}^{\prime} is the uu-th element of 𝐱\mathbf{x}^{\prime}, 𝕀xu𝒜\mathbb{I}_{x_{u}^{\prime}\in\mathcal{A}} denotes an indicator function that returns one only if xu𝒜x_{u}^{\prime}\in\mathcal{A} and zero otherwise, and Z𝒜(𝐱)Z_{\mathcal{A}}(\mathbf{x}) denotes a normalization constant given by

Z𝒜(𝐱)=𝐱𝒜U×1exp(1γ2𝐱𝐱+τf(𝐱)2),Z_{\mathcal{A}}(\mathbf{x})=\sum_{\mathbf{x}^{\prime}\in\mathcal{A}^{U\times 1}}\exp\Big{(}-\frac{1}{\gamma^{2}}\|\mathbf{x}^{\prime}-\mathbf{x}+\tau\nabla f_{\mathcal{I}}(\mathbf{x})\|^{2}\Big{)}, (39)

which is generally intractable due to the requirement for traversal over the entire space of size MUM^{U}.

To address this challenge and simplify computation, we choose to omit the proposal probability ratio q(𝐱|𝐱)q(𝐱|𝐱)\frac{q(\mathbf{x}|\mathbf{x}^{\prime})}{q(\mathbf{x}^{\prime}|\mathbf{x})} in (37), by noting that f(𝐱)𝟎\nabla f_{\mathcal{I}}(\mathbf{x})\to\mathbf{0} at the latter iterations of the algorithm as the chain reaches the high probability region of the state space. Given mild assumptions from [46], this strategy is supported by the following proposition.

Proposition 1

Assume the following:

  1. 1.

    f(𝐱)=𝟎\nabla f_{\mathcal{I}}(\mathbf{x})=\mathbf{0}.

  2. 2.

    The learning rate τ=o(1)\tau=o(1) for large UU, where o()o(\cdot) denotes the standard asymptotic notation for higher-order infinitesimals [39, Ch. 3.1].

  3. 3.

    There exists a λ>0\lambda>0 such that sup(𝐱,)H(𝐱)𝒛λ𝒛\sup_{(\mathbf{x},\mathcal{I})}\left\|H_{\mathcal{I}}(\mathbf{x})\boldsymbol{z}\right\|\leq\lambda\|\boldsymbol{z}\| for all 𝒛U×1\boldsymbol{z}\in\mathbb{C}^{U\times 1}, where:

    • sup(𝐱,)\sup_{(\mathbf{x},\mathcal{I})} denotes the supremum over all possible selections of (𝐱,)(\mathbf{x},\mathcal{I}),

    • H(𝐱)H_{\mathcal{I}}(\mathbf{x}) denotes the Hessian matrix on mini-batch \mathcal{I}.

Then, we have

q(𝐱|𝐱)q(𝐱|𝐱)=1+o(1)1.\frac{q(\mathbf{x}|\mathbf{x}^{\prime})}{q(\mathbf{x}^{\prime}|\mathbf{x})}=1+o(1)\approx 1. (40)
Proof:

Defining 𝐱𝐱𝒛\mathbf{x}^{\prime}-\mathbf{x}\triangleq\boldsymbol{z} and by the assumption f(𝐱)=𝟎\nabla f_{\mathcal{I}}(\mathbf{x})=\mathbf{0}, we have

q(𝐱|𝐱)q(𝐱|𝐱)\displaystyle\frac{q(\mathbf{x}|\mathbf{x}^{\prime})}{q(\mathbf{x}^{\prime}|\mathbf{x})} =exp(1γ2𝐱𝐱+τf(𝐱)2)Z𝒜(𝐱)exp(1γ2𝐱𝐱+τf(𝐱)2)Z𝒜(𝐱)\displaystyle=\frac{\exp\left(-\frac{1}{\gamma^{2}}\|\mathbf{x}-\mathbf{x}^{\prime}+\tau\nabla f_{\mathcal{I}}(\mathbf{x}^{\prime})\|^{2}\right)Z_{\mathcal{A}}(\mathbf{x})}{\exp\left(-\frac{1}{\gamma^{2}}\|\mathbf{x}^{\prime}-\mathbf{x}+\tau\nabla f_{\mathcal{I}}(\mathbf{x})\|^{2}\right)Z_{\mathcal{A}}(\mathbf{x}^{\prime})}
=exp(1γ2τf(𝐱)𝒛2)Z𝒜(𝐱)exp(1γ2𝒛2)Z𝒜(𝐱).\displaystyle=\frac{\exp\left(-\frac{1}{\gamma^{2}}\|\tau\nabla f_{\mathcal{I}}(\mathbf{x}^{\prime})-\boldsymbol{z}\|^{2}\right)Z_{\mathcal{A}}(\mathbf{x})}{\exp\left(-\frac{1}{\gamma^{2}}\|\boldsymbol{z}\|^{2}\right)Z_{\mathcal{A}}(\mathbf{x}^{\prime})}. (41)

Note that

τf(𝐱)𝒛2\displaystyle\|\tau\nabla f_{\mathcal{I}}(\mathbf{x}^{\prime})-\boldsymbol{z}\|^{2}
=\displaystyle= τ2f(𝐱)2+𝒛2τ(f(𝐱)H𝒛+𝒛Hf(𝐱)).\displaystyle{\tau^{2}}\|\nabla f_{\mathcal{I}}(\mathbf{x}^{\prime})\|^{2}+\|\boldsymbol{z}\|^{2}-{\tau}\big{(}\nabla f_{\mathcal{I}}(\mathbf{x}^{\prime})^{\rm H}\boldsymbol{z}+\boldsymbol{z}^{\rm H}\nabla f_{\mathcal{I}}(\mathbf{x}^{\prime})\big{)}. (42)

Using the Taylor series expansion f(𝐱)=f(𝐱)+H(𝐱0)(𝐱𝐱)=H(𝐱0)𝒛\nabla f_{\mathcal{I}}(\mathbf{x}^{\prime})=\nabla f_{\mathcal{I}}(\mathbf{x})+H_{\mathcal{I}}(\mathbf{x}_{0})(\mathbf{x}^{\prime}-\mathbf{x})=H_{\mathcal{I}}(\mathbf{x}_{0})\boldsymbol{z} for some 𝐱0\mathbf{x}_{0} between 𝐱\mathbf{x} and 𝐱\mathbf{x}^{\prime}, and by the assumption sup(𝐱,)H(𝐱)𝒛λ𝒛\sup_{(\mathbf{x},\mathcal{I})}\left\|H_{\mathcal{I}}(\mathbf{x})\boldsymbol{z}\right\|\leq\lambda\|\boldsymbol{z}\|, we have

f(𝐱)2\displaystyle\|\nabla f_{\mathcal{I}}(\mathbf{x}^{\prime})\|^{2} =H(𝐱0)𝒛2λ2𝒛2,\displaystyle=\|H_{\mathcal{I}}(\mathbf{x}_{0})\boldsymbol{z}\|^{2}\leq\lambda^{2}\|\boldsymbol{z}\|^{2}, (43)
|f(𝐱)H𝒛|\displaystyle|\nabla f_{\mathcal{I}}(\mathbf{x}^{\prime})^{\rm H}\boldsymbol{z}| =|𝒛Hf(𝐱)|f(𝐱)𝒛\displaystyle=|\boldsymbol{z}^{\rm H}\nabla f_{\mathcal{I}}(\mathbf{x}^{\prime})|\leq\|\nabla f_{\mathcal{I}}(\mathbf{x}^{\prime})\|\|\boldsymbol{z}\|
=H(𝐱0)𝒛𝒛λ𝒛2,\displaystyle=\|H_{\mathcal{I}}(\mathbf{x}_{0})\boldsymbol{z}\|\|\boldsymbol{z}\|\leq\lambda\|\boldsymbol{z}\|^{2}, (44)

where the first inequality in (44) follows from the Cauchy-Schwarz inequality. Then by the assumption τ=o(1)\tau=o(1) for large UU, which is approximately satisfied by the learning rate selection in (26), it follows that (42) is of the same order as (1+𝒪(τ))𝒛2=(1+o(1))𝒛2(1+\mathcal{O}(\tau))\|\boldsymbol{z}\|^{2}=(1+o(1))\|\boldsymbol{z}\|^{2}. Therefore, the proposal probability ratio in (41) becomes

q(𝐱|𝐱)q(𝐱|𝐱)=exp(1γ2𝒛2)Z𝒜(𝐱)exp(1γ2𝒛2)Z𝒜(𝐱)(1+o(1))1,\frac{q(\mathbf{x}|\mathbf{x}^{\prime})}{q(\mathbf{x}^{\prime}|\mathbf{x})}=\frac{\exp\left(-\frac{1}{\gamma^{2}}\|\boldsymbol{z}\|^{2}\right)Z_{\mathcal{A}}(\mathbf{x})}{\exp\left(-\frac{1}{\gamma^{2}}\|\boldsymbol{z}\|^{2}\right)Z_{\mathcal{A}}(\mathbf{x}^{\prime})}(1+o(1))\approx 1, (45)

completing the proof. ∎

After omitting the proposal probability ratio, we derive the modified acceptance probability as follows

A(𝐱|𝐱)=min{1,exp(𝐲𝐇𝐱2)exp(𝐲𝐇𝐱2)},A(\mathbf{x}^{\prime}|\mathbf{x})=\min\left\{1,\frac{\exp(-\|\mathbf{y}-\mathbf{H}\mathbf{x}^{\prime}\|^{2})}{\exp(-\|\mathbf{y}-\mathbf{H}\mathbf{x}\|^{2})}\right\}, (46)

which we used in (28) for our proposed algorithm. By accepting 𝐱\mathbf{x}^{\prime} as the new state with this probability, the Markov chain approximately satisfies detailed balance, which is a sufficient condition to align the stationary distribution of the Markov chain with the target posterior distribution π\pi [18, 47].

Moreover, for any two states 𝐱,𝐱𝒜U×1\mathbf{x},\mathbf{x}^{\prime}\in\mathcal{A}^{U\times 1}, the transition probability of the Markov chain is given by

T(𝐱|𝐱)\displaystyle T(\mathbf{x}^{\prime}|\mathbf{x})
={q(𝐱|𝐱)A(𝐱|𝐱), if 𝐱𝐱,q(𝐱|𝐱)+𝐮𝐱q(𝐮|𝐱)(1A(𝐮|𝐱)), otherwise,\displaystyle=\left\{\begin{array}[]{ll}q(\mathbf{x}^{\prime}|\mathbf{x})A(\mathbf{x}^{\prime}|\mathbf{x}),&\text{ if }\mathbf{x}^{\prime}\neq\mathbf{x},\\ q(\mathbf{x}|\mathbf{x})+\sum_{\mathbf{u}\neq\mathbf{x}}q(\mathbf{u}|\mathbf{x})\big{(}1-A(\mathbf{u}|\mathbf{x})\big{)},&\text{ otherwise},\end{array}\right. (49)

where the second branch holds because the chain remains in state 𝐱\mathbf{x} either when the candidate is 𝐱\mathbf{x} or if the candidate is 𝐮𝐱\mathbf{u}\neq\mathbf{x} but gets rejected. With this transition probability, it is easy to show that

T(𝐱t=𝐱|𝐱t1=𝐱)>0,𝐱,𝐱𝒜U×1,T\big{(}\mathbf{x}_{t}=\mathbf{x}^{\prime}|\mathbf{x}_{t-1}=\mathbf{x}\big{)}>0,\;\forall\mathbf{x},\mathbf{x}^{\prime}\in\mathcal{A}^{U\times 1}, (50)

implying the chain’s irreducibility by definition [48], and

gcd{m:T(𝐱t+m=𝐱|𝐱t=𝐱)>0}=1,m+,\text{gcd}\left\{m:T\big{(}\mathbf{x}_{t+m}=\mathbf{x}|\mathbf{x}_{t}=\mathbf{x}\big{)}>0\right\}=1,\;\forall m\in\mathbb{Z}^{+}, (51)

where +\mathbb{Z}^{+} denotes the set of positive integers, and “gcd” represents the greatest common divisor, verifying the chain’s aperiodicity by definition [48]. Using these properties, it can be shown that π\pi is a unique stationary distribution of the proposed algorithm’s underlying Markov chain [49, Corollary 1.17]. This chain approximately converges to the target posterior distribution π\pi according to the convergence theorem of MCMC [49, Th. 4.9], thereby verifying the convergence of the proposed algorithm.

References

  • [1] X. Zhou, L. Liang, J. Zhang, C.-K. Wen, and S. Jin, “Decentralized massive MIMO detection using mini-batch gradient-based MCMC,” in Proc. IEEE Int. Workshop Signal Process. Adv. Wireless Commun. (SPAWC), Lucca, Italy, Sep. 2024, pp. 1–5.
  • [2] E. Björnson, J. Hoydis, and L. Sanguinetti, “Massive MIMO networks: Spectral, energy, and hardware efficiency,” Foundations and Trends® in Signal Processing, vol. 11, no. 3-4, pp. 154–655, 2017.
  • [3] S. Yang and L. Hanzo, “Fifty years of MIMO detection: The road to large-scale MIMOs,” IEEE Commun. Surv. Tuts., vol. 17, no. 4, pp. 1941–1988, 2015.
  • [4] E. Björnson, L. Sanguinetti, H. Wymeersch, J. Hoydis, and T. L. Marzetta, “Massive MIMO is a reality—What is next?: Five promising research directions for antenna arrays,” Digit. Signal Process., vol. 94, pp. 3–20, Nov. 2019.
  • [5] A. Puglielli et al., “Design of energy- and cost-efficient massive MIMO arrays,” Proc. IEEE, vol. 104, no. 3, pp. 586–606, Mar. 2016.
  • [6] K. Li, R. R. Sharan, Y. Chen, T. Goldstein, J. R. Cavallaro, and C. Studer, “Decentralized baseband processing for massive MU-MIMO systems,” IEEE J. Emerg. Sel. Topics Circuits Syst., vol. 7, no. 4, pp. 491–507, Dec. 2017.
  • [7] J. Rodríguez Sánchez, F. Rusek, O. Edfors, M. Sarajlić, and L. Liu, “Decentralized massive MIMO processing exploring daisy-chain architecture and recursive algorithms,” IEEE Trans. Signal Process., vol. 68, pp. 687–700, Jan. 2020.
  • [8] “Common public radio interface,” [Online]. Available: http://www.cpri.info, 2019.
  • [9] A. Kulkarni, M. A. Ouameur, and D. Massicotte, “Hardware topologies for decentralized large-scale MIMO detection using Newton method,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 68, no. 9, pp. 3732–3745, Sep. 2021.
  • [10] K. Li, Y. Chen, R. Sharan, T. Goldstein, J. R. Cavallaro, and C. Studer, “Decentralized data detection for massive MU-MIMO on a Xeon Phi cluster,” in Proc. Asilomar Conf. Signals, Syst. Comput., Nov. 2016, pp. 468–472.
  • [11] K. Li, O. Castañeda, C. Jeon, J. R. Cavallaro, and C. Studer, “Decentralized coordinate-descent data detection and precoding for massive MU-MIMO,” in Proc. IEEE Int. Symp. Circuits Syst., May 2019, pp. 1–5.
  • [12] C. Jeon, K. Li, J. R. Cavallaro, and C. Studer, “Decentralized equalization with feedforward architectures for massive MU-MIMO,” IEEE Trans. Signal Process., vol. 67, no. 17, pp. 4418–4432, Sep. 2019.
  • [13] E. Bertilsson, O. Gustafsson, and E. G. Larsson, “A scalable architecture for massive MIMO base stations using distributed processing,” in Proc. Asilomar Conf. Signals, Syst., Comput., Nov. 2016, pp. 864–868.
  • [14] H. Wang, A. Kosasih, C.-K. Wen, S. Jin, and W. Hardjawana, “Expectation propagation detector for extra-large scale massive MIMO,” IEEE Trans. Wireless Commun., vol. 19, no. 3, pp. 2036–2051, Mar. 2020.
  • [15] Z. Zhang, H. Li, Y. Dong, X. Wang, and X. Dai, “Decentralized signal detection via expectation propagation algorithm for uplink massive MIMO systems,” IEEE Trans. Veh. Technol., vol. 69, no. 10, pp. 11 233–11 240, Oct. 2020.
  • [16] Y. Dong, H. Li, C. Gong, X. Wang, and X. Dai, “An enhanced fully decentralized detector for the uplink M-MIMO system,” IEEE Trans. Veh. Technol., vol. 71, no. 12, pp. 13 030–13 042, Dec. 2022.
  • [17] Z. Zhang, Y. Dong, K. Long, X. Wang, and X. Dai, “Decentralized baseband processing with Gaussian message passing detection for uplink massive MU-MIMO systems,” IEEE Trans. Veh. Technol., vol. 71, no. 2, pp. 2152–2157, Feb. 2022.
  • [18] C. M. Bishop, Pattern recognition and machine learning.   New York, NY, USA: Springer, 2006.
  • [19] S. Brooks, A. Gelman, G. Jones, and X.-L. Meng, Handbook of Markov chain Monte Carlo.    Boca Raton, FL, USA: Chapman & Hall, 2011.
  • [20] B. Farhang-Boroujeny, H. Zhu, and Z. Shi, “Markov chain Monte Carlo algorithms for CDMA and MIMO communication systems,” IEEE Trans. Signal Process., vol. 54, no. 5, pp. 1896–1909, May 2006.
  • [21] J. C. Hedstrom, C. H. Yuen, R.-R. Chen, and B. Farhang-Boroujeny, “Achieving near MAP performance with an excited Markov chain Monte Carlo MIMO detector,” IEEE Trans. Wireless Commun., vol. 16, no. 12, pp. 7718–7732, Dec. 2017.
  • [22] S. A. Laraway and B. Farhang-Boroujeny, “Implementation of a Markov chain Monte Carlo based multiuser/MIMO detector,” IEEE Trans. Circuits and Syst. I: Reg. Papers, vol. 56, no. 1, pp. 246–255, Jan. 2009.
  • [23] Y.-A. Ma, Y. Chen, C. Jin, N. Flammarion, and M. I. Jordan, “Sampling can be faster than optimization,” Proc. Nat. Acad. Sci., vol. 116, no. 42, pp. 20 881–20 885, Sep. 2019.
  • [24] M. Welling and Y. W. Teh, “Bayesian learning via stochastic gradient Langevin dynamics,” in Proc. 28th Int. Conf. Mach. Learn. (ICML), pp. 681-688, 2011.
  • [25] N. M. Gowda, S. Krishnamurthy, and A. Belogolovy, “Metropolis-Hastings random walk along the gradient descent direction for MIMO detection,” in Proc. IEEE Int. Conf. Commun. (ICC), Montreal, QC, Canada, Jun. 2021, pp. 1–7.
  • [26] N. Zilberstein, C. Dick, R. Doost-Mohammady, A. Sabharwal, and S. Segarra, “Annealed Langevin dynamics for massive MIMO detection,” IEEE Trans. Wireless Commun., vol. 22, no. 6, pp. 3762–3776, Jun. 2023.
  • [27] X. Zhou, L. Liang, J. Zhang, C.-K. Wen, and S. Jin, “Gradient-based Markov chain Monte Carlo for MIMO detection,” IEEE Trans. Wireless Commun., vol. 23, no. 7, pp. 7566–7581, Jul. 2024.
  • [28] Y. Nesterov, Introductory lectures on convex optimization.   Springer Science+Business Media, 2004.
  • [29] H. Li, Y. Dong, C. Gong, X. Wang, and X. Dai, “Decentralized groupwise expectation propagation detector for uplink massive MU-MIMO systems,” IEEE Internet Things J., vol. 10, no. 6, pp. 5393–5405, Mar. 2023.
  • [30] X. Zhao, M. Li, B. Wang, E. Song, T.-H. Chang, and Q. Shi, “Decentralized equalization for massive MIMO systems with colored noise samples,” May 2023. [Online] Available: http://arxiv.org/abs/2305.12805.
  • [31] W. K. Hastings, “Monte Carlo sampling methods using Markov chains and their applications,” Biometrika, vol. 57, no. 1, pp. 97–109, Apr. 1970.
  • [32] L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large-scale machine learning,” SIAM Rev., vol. 60, no. 2, pp. 223–311, Jan. 2018.
  • [33] B. O’Donoghue and E. Candès, “Adaptive restart for accelerated gradient schemes,” Found. Comput. Math., vol. 15, no. 3, pp. 715–732, Jun. 2015.
  • [34] H. Q. Ngo, E. G. Larsson, and T. L. Marzetta, “Aspects of favorable propagation in massive MIMO,” in Proc. Eur. Signal Process. Conf. (EUSIPCO), Sep. 2014, pp. 76–80.
  • [35] A. Barbu and S.-C. Zhu, “Hamiltonian and Langevin Monte Carlo,” in Monte Carlo methods, A. Barbu and S.-C. Zhu, Eds.   Singapore: Springer, 2020, pp. 281–325.
  • [36] J. Martin, C. L. Wilcox, C. Burstedde, and O. Ghattas, “A stochastic Newton MCMC method for large-scale statistical inverse problems with application to seismic inversion,” SIAM J. Sci. Comput., vol. 34, no. 3, pp. 1460–1487, Jan. 2012.
  • [37] K. P. Murphy, Probabilistic machine learning: Advanced topics.,    Cambridge, MA, USA: MIT Press, 2023.
  • [38] P. W. Wolniansky, G. J. Foschini, G. D. Golden, and R. A. Valenzuela, “V-BLAST: An architecture for realizing very high data rates over the rich-scattering wireless channel,” Proc. Int. Symp. Signals Syst. Electron. (ISSSE), Oct. 1998, pp. 295–300.
  • [39] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, 2nd ed.   Cambridge, Massachusetts: MIT Press, 2001.
  • [40] Z. Guo and P. Nilsson, “Algorithm and implementation of the K-best sphere decoding for MIMO detection,” IEEE J. Sel. Areas Commun., vol. 24, no. 3, pp. 491–503, Mar. 2006.
  • [41] 3GPP TR 36.873, “Study on 3D channel model for LTE (Release 12),” Tech. Rep., Jun. 2017. [Online]. Available: https://www.3gpp.org/ftp/Specs/archive/36_series/36.873/.
  • [42] X. Qian and D. Klabjan, “The impact of the mini-batch size on the variance of gradients in stochastic gradient descent,” Apr. 2020. [Online] Available: http://arxiv.org/abs/2004.13146.
  • [43] S. Jaeckel, L. Raschkowski, K. Börner, and L. Thiele, “QuaDRiGa: A 3-D multi-cell channel model with time evolution for enabling virtual field trials,” IEEE Trans. Antennas Propag., vol. 62, no. 6, pp. 3242–3256, Jun. 2014.
  • [44] B. Hassibi, M. Hansen, A. G. Dimakis, H. A. J. Alshamary, and W. Xu, “Optimized Markov chain Monte Carlo for signal detection in MIMO systems: An analysis of the stationary distribution and mixing time,” IEEE Trans. Signal Process., vol. 62, no. 17, pp. 4436–4450, Sep. 2014.
  • [45] R. Zhang, X. Liu, and Q. Liu, “A Langevin-like sampler for discrete distributions,” in Proc. Int. Conf. Mach. Learn. (ICML), Jun. 2022, pp. 26 375–26 396.
  • [46] T.-Y. Wu, Y. X. R. Wang, and W. H. Wong, “Mini-batch Metropolis-Hastings with reversible SGLD proposal,” J. Amer. Stat. Assoc., vol. 117, no. 537, pp. 386–394, Jan. 2022.
  • [47] A. Doucet and X. Wang, “Monte Carlo methods for signal processing: A review in the statistical signal processing context,” IEEE Signal Process. Mag., vol. 22, no. 6, pp. 152–170, Nov. 2005.
  • [48] J. R. Norris, Markov Chains.   Cambridge, U.K.: Cambridge Univ. Press, 1997.
  • [49] D. A. Levin, Y. Peres, and E. L. Wilmer, Markov Chains and Mixing Times.   Providence, RI, USA: Amer. Math. Soc., 2008.