Mini-Batch Gradient-Based MCMC for Decentralized Massive MIMO Detection

Xingyu Zhou, Le Liang, Jing Zhang, Chao-Kai Wen, and Shi Jin, This paper was accepted in part for presentation at the International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), Lucca, Italy, in September 2024 [1].X. Zhou, J. Zhang, and S. Jin are with the National Mobile Communications Research Laboratory, Southeast University, Nanjing 210096, China (e-mail: xy_zhou@seu.edu.cn; jingzhang@seu.edu.cn; jinshi@seu.edu.cn).L. Liang is with the National Mobile Communications Research Laboratory, Southeast University, Nanjing 210096, China, and also with Purple Mountain Laboratories, Nanjing 211111, China (e-mail: lliang@seu.edu.cn).C.-K. Wen is with the Institute of Communications Engineering, National Sun Yat-sen University, Kaohsiung 80424, Taiwan (e-mail: chaokai.wen@mail.nsysu.edu.tw).

Abstract

Massive multiple-input multiple-output (MIMO) technology has significantly enhanced spectral and power efficiency in cellular communications and is expected to further evolve towards extra-large-scale MIMO. However, centralized processing for massive MIMO faces practical obstacles, including excessive computational complexity and a substantial volume of baseband data to be exchanged. To address these challenges, decentralized baseband processing has emerged as a promising solution. This approach involves partitioning the antenna array into clusters with dedicated computing hardware for parallel processing. In this paper, we investigate the gradient-based Markov chain Monte Carlo (MCMC) method—an advanced MIMO detection technique known for its near-optimal performance in centralized implementation—within the context of a decentralized baseband processing architecture. This decentralized design mitigates the computation burden at a single processing unit by utilizing computational resources in a distributed and parallel manner. Additionally, we integrate the mini-batch stochastic gradient descent method into the proposed decentralized detector, achieving remarkable performance with high efficiency. Simulation results demonstrate substantial performance gains of the proposed method over existing decentralized detectors across various scenarios. Moreover, complexity analysis reveals the advantages of the proposed decentralized strategy in terms of computation delay and interconnection bandwidth when compared to conventional centralized detectors.

Index Terms:

Massive MIMO, decentralized baseband processing, data detection, Markov chain Monte Carlo, mini-batch stochastic gradient descent.

I Introduction

MASSIVE multiple-input multiple-output (MIMO) technology has emerged as a promising solution to address the increasing demand for high data rates and improved spectral efficiency in wireless communication systems of the fifth generation and beyond [2]. By deploying a large number of antennas at the base station (BS), it becomes feasible to achieve a substantial enhancement in network capacity and throughput. However, the utilization of an extremely large antenna array introduces high-dimensional signal processing tasks, particularly challenging in MIMO uplinks, where accurate detection of signals from the users is required to fulfill the potential of massive MIMO [3, 4].

Many existing MIMO detection schemes, such as the optimal maximum likelihood (ML) detector and various linear and nonlinear detectors [3], are designed according to the conventional centralized baseband processing architecture. However, this architecture faces several challenges when applied to extremely large-scale MIMO systems. In this setup, a central unit (CU) assumes responsibility for the entire symbol detection process, imposing stringent demands on the computational capability, baseband interconnection bandwidth, and storage capacity of a single unit [5, 6, 7]. Notably, existing MIMO detectors inherently involve matrix operations, such as matrix multiplications or inversions, which become impractical on a single computing device as the antenna count increases due to substantial computational complexity. Additionally, for centralized MIMO detection, the transfer of global channel state information and raw baseband data from all BS antennas to the CU is required. This process demands significant interconnection bandwidth, often surpassing data rate limits of existing interconnection standards or chip I/O interfaces [8, 6, 5].

To address these challenges, decentralized baseband processing (DBP) has been proposed and gained significant attention recently [6]. The concept behind DBP is to partition the large-scale antenna array into a few smaller disjoint antenna clusters, each with local radio frequency (RF) chains and independent baseband processing modules located in distributed units (DUs). The DUs handle local baseband data received from RF chains within their respective clusters. This DBP approach distributes the demanding signal processing tasks across parallel procedures on multiple computing units, like field-programmable gate arrays [9] or graphics processing units [10], substantially reducing computation demands at the CU and overall processing latency. Moreover, the requirement for raw baseband data transfer between each baseband unit and the corresponding RF circuitry is considerably diminished compared to the centralized architecture, as only a subset of BS antennas needs to transmit their data to the relevant processing module.

Despite the potential advantages of the DBP architecture, research into customized decentralized MIMO detection schemes within this framework remains incomplete. Specifically, the exploration of high-performance decentralized detectors is as important as it is currently lacking. Pioneering work [6] decentralized the conjugate gradient algorithm and the alternating direction method of multipliers (ADMM), resulting in iterative consensus-sharing schemes approaching the performance of the centralized linear minimum mean square error (LMMSE) method. However, the latter is widely known to be suboptimal. Similar endeavors were pursued in [9] and [11] to decentralize the Newton and coordinate descent methods, respectively, aiming to bypass the intricate matrix inversions involved in centralized linear detectors. Furthermore, authors of [12] developed decentralized feedforward architectures [5, 13] for classical maximum ratio combining, zero-forcing, and LMMSE detectors, considerably reducing system latency. Nonetheless, these linear detectors are markedly suboptimal, and the fully decentralized scheme experiences substantial performance degradation with an increasing number of clusters.

Instead of decentralizing linear detectors, an alternative research direction explores the decentralized design of nonlinear message passing-based detection methods. Specifically, authors of [12] also applied the large-MIMO approximate message passing (LAMA) algorithm within partially and fully decentralized feedforward architectures. However, LAMA’s instability arises when channel fading coefficients deviate from the independent and identically distributed (IID) Gaussian distribution. In [14], a decentralized detector based on expectation propagation (EP) was proposed, demonstrating performance advantages over various decentralized linear detectors [6, 10, 11]. Nonetheless, the decentralized EP method necessitates iterative matrix inversions at each DU, posing challenges for achieving computational efficiency with increasing user numbers. Other message passing detector variants tailored for the decentralized architecture have been explored [15, 16, 17]. Regrettably, these endeavors have yet to strike a desirable balance between performance and complexity.

As evident from the discussion above, existing decentralized MIMO detectors often struggle with suboptimal trade-offs between performance and complexity—particularly in scenarios with high user density or strong correlation channels. These limitations underscore the need to explore advanced detection techniques for formulating more competitive decentralized detectors. Recently, Markov chain Monte Carlo (MCMC) has demonstrated its potential to overcome the limitations of deterministic Bayesian inference approaches (e.g., EP) [18, 19]. MCMC-based MIMO detectors are favored for their near-optimal performance [20, 21] and hardware-friendly architecture [22]. Furthermore, the combination of gradient descent (GD) optimization and MCMC offers a framework to efficiently solve the MIMO detection optimum search problem [23, 24, 25, 26]. This scheme employs GD to explore important regions of the discrete search space and utilizes the random walk and acceptance test of MCMC to generate samples conforming to the posterior distribution of transmit symbols [27]. Thus, the gradient-based MCMC method holds the potential to achieve near-ML performance, unattainable by LMMSE or message passing-based detectors, while maintaining acceptable complexity. Notably, recent work [27] combines MCMC with Nesterov’s accelerated gradient (NAG) method [28] to derive a NAG-aided MCMC (NAG-MCMC) detector with near-optimal performance, high efficiency, and excellent scalability. However, existing literature has not explored the decentralized design of potent MCMC-based MIMO detectors and their gradient-based variations.

Motivated by the potential benefits of the DBP architecture and the limited exploration of decentralized detectors within this framework, this study seeks to establish a decentralized gradient-based MCMC detector operating within the DBP framework. The central goal is to mitigate computational and interconnection bandwidth limitations associated with conventional centralized processing, all while upholding near-optimal detection performance. The contributions of this paper are summarized as follows:

•

Decentralized Gradient-Based MCMC Detector for DBP Architecture: This study pioneers the decentralization of the gradient-based MCMC method for MIMO detection within the DBP architecture, employing both star and daisy-chain topologies. The proposed decentralized detector inherits the remarkable detection performance and scalability inherent in the original gradient-based MCMC algorithm. Decentralization enables distributed and parallel computations, efficiently redistributing the computation burden from the CU to multiple DUs. Consequently, our proposed detector exhibits high computational efficiency in terms of latency.
•

Mini-Batch-Based Gradient Updating for Efficiency Enhancement: We introduce a mini-batch-based gradient updating rule for the decentralized gradient-based MCMC detector, resulting in the Mini-NAG-MCMC algorithm. This innovation significantly reduces computation and interconnection demands by selecting only a mini-batch of DUs for gradient computations per iteration. The integration of stochastic gradients within the updating rule synergizes with the MCMC technique, facilitating escape from local minima and achieving near-optimal performance. Additionally, we leverage the asymptotically favorable propagation property of massive MIMO systems, incorporating an approximation into the proposed detector to further reduce overhead.
•

Comprehensive Analysis and Performance Validation: We conduct a comprehensive analysis of the proposed method’s computational complexity and interconnection bandwidth, showcasing its superiority over centralized schemes. Furthermore, extensive simulations validate the method’s performance superiority over existing decentralized detectors across diverse scenarios, including IID Rayleigh fading MIMO channels and realistic 3rd Generation Partnership Project (3GPP) channel models. These results affirm the proposed method’s effective balance between performance and complexity.

Refer to caption — Figure 1: Block diagram of the uplink massive MIMO system with the DBP architecture. The solid lines depict the connections between CU and DUs in the star DBP topology, while the dashed lines depict the counterparts in the daisy-chain DBP topology.

Notations: Boldface letters denote column vectors or matrices. $\mathbf{0}$ denotes a zero vector, and $\mathbf{I}$ denotes an identity matrix. $\mathbf{A}^{T}$ , $\mathbf{A}^{\rm H}$ , and $\mathbf{A}^{-1}$ denote the transpose, conjugate transpose, and inverse of the matrix $\mathbf{A}$ , respectively. ${\rm diag}(\mathbf{A})$ generates a diagonal matrix by assigning zero values to the off-diagonal elements of $\mathbf{A}$ . $\|\cdot\|$ is $l_{2}$ norm, and $\|\cdot\|_{F}$ is Frobenius norm. $\mathbb{E}[\cdot]$ represents the expectation of random variables. $\mathcal{CN}(\mu,\sigma^{2})$ denotes a circularly symmetric complex Gaussian distribution with mean $\mu$ and variance $\sigma^{2}$ . $\mathcal{U}(a,b)$ indicates a uniform distribution between $[a,b]$ . $\mathbb{R}$ and $\mathbb{C}$ are the sets of real and complex numbers, respectively. The set $[K]=\{1,2,\ldots,K\}$ contains all nonnegative integers up to $K$ .

II System Model

In this section, we initially present the uplink MIMO system model and its associated MIMO detection problem. Subsequently, we elaborate on the particulars of the DBP architecture for MIMO uplink and discuss the fundamental aspects of designing decentralized MIMO detectors.

II-A Uplink MIMO System Model

Consider an uplink massive MIMO system with $U$ single-antenna users and a BS equipped with $B\geq U$ receiving antennas, as shown in Fig. 1. Each user independently encodes its information bit stream, modulates the coded bits using the quadrature amplitude modulation (QAM) scheme, and transmits the modulated symbols over the wireless channels to the BS. The input-output relationship¹¹1We consider a flat-fading channel in the system model; however, the proposed method can be readily generalized to handle frequency-selectivity by using orthogonal frequency division multiplexing (OFDM). In this approach, MIMO detection is independently conducted on each flat-fading subcarrier. of this uplink transmission is given by

\mathbf{y}=\mathbf{Hx}+\mathbf{n},

(1)

where $\mathbf{y}\in\mathbb{C}^{B\times 1}$ is the received signal at the BS, $\mathbf{H}\in\mathbb{C}^{B\times U}$ is the uplink channel matrix, $\mathbf{x}\in\mathcal{A}^{U\times 1}$ consists of transmit symbols from each user, with $\mathcal{A}$ representing the QAM lattice set having a cardinality of $M$ and unit symbol energy, and $\mathbf{n}\in\mathbb{C}^{B\times 1}$ represents the IID circular complex Gaussian noise with zero-mean and variance $\sigma^{2}$ entries. The signal-to-noise ratio (SNR) is given by $\text{SNR}=\mathbb{E}[\|\mathbf{Hx}\|^{2}]/\mathbb{E}[\|\mathbf{n}\|^{2}]$ . The objective of uplink MIMO detection is to estimate the transmit vector $\mathbf{x}$ given the received signal $\mathbf{y}$ and the knowledge of the uplink channel matrix $\mathbf{H}$ . The optimal ML solution is given by

{\mathbf{x}}^{\ast}=\underset{\mathbf{x}\in\mathcal{A}^{{U\times 1}}}{\arg\min}\;\|\mathbf{y}-\mathbf{Hx}\|^{2}.

(2)

Note that the finite set of $\mathbf{x}$ makes the optimal MIMO detection an integer least-squares problem, known to be non-deterministic polynomial-time hard (NP-hard) [3].

II-B Decentralized Baseband Processing

To deal with the heavy interconnection and computation burdens due to the use of extra-large-scale antenna arrays, DBP has been proposed as a promising alternative to traditional centralized processing [6]. Throughout the paper, we assume that the massive MIMO BS possesses the DBP architecture as shown in Fig. 1. The BS antenna array is partitioned into $C\geq 1$ antenna clusters, where the $c$ -th cluster is associated with $B_{c}$ antennas ( $c\in[C]$ ), and $\sum_{c=1}^{C}B_{c}=B$ . Without loss of generality, we assume that each cluster is of equal size, i.e., $B_{c}=B/C$ . Each cluster is equipped with local RF chains that are connected to dedicated hardware referred to as the DU.

We consider two widely accepted DBP topologies, namely star and daisy-chain topologies [6, 7, 9, 29, 30], which are depicted in Fig. 1. We assume data links among the CU and DUs are bidirectional for both topologies. The star topology is featured by all DUs connected to the CU (solid lines). Thus, the information exchange between the CU and each DU is in parallel. The daisy-chain topology is featured by DUs connected sequentially [7] (dashed lines), with a single connection to the CU at the end. Hence, communication is limited solely to adjacent units in the daisy-chain topology and the information exchange is performed in a sequential manner. Both these topologies have their distinct strengths in the trade-off among throughput, latency, and interconnection bandwidth [9]. The star topology is well recognized for the low data transfer latency owing to the simultaneous information exchange between the CU and each DU, while the daisy-chain topology is known for the constant interconnection bandwidth regardless of the number of DUs, as will be shown in Sec. IV-B.

Under the DBP architecture, the uplink channel matrix, received signal, and noise vector are partitioned row-wise as $\mathbf{H}=[\mathbf{H}_{1}^{T},\ldots,\mathbf{H}_{C}^{T}]^{T}$ , $\mathbf{y}=[\mathbf{y}_{1}^{T},\ldots,\mathbf{y}_{C}^{T}]^{T}$ , and $\mathbf{n}=[\mathbf{n}_{1}^{T},\ldots,\mathbf{n}_{C}^{T}]^{T}$ , respectively. Hence, the signal model at the $c$ -th antenna cluster can be written as

\mathbf{y}_{c}=\mathbf{H}_{c}\mathbf{x}+\mathbf{n}_{c},\;c\in[C],

(3)

where $\mathbf{H}_{c}\in\mathbb{C}^{B_{c}\times U}$ is the channel matrix between the $U$ users and the $c$ -th antenna cluster. $\mathbf{y}_{c}\in\mathbb{C}^{B_{c}\times 1}$ and $\mathbf{n}_{c}\in\mathbb{C}^{{B}_{c}\times 1}$ represent the local received signal and noise vector of the $c$ -th antenna cluster, respectively.

Each DU $c$ exclusively utilizes its local channel matrix $\mathbf{H}_{c}$ and the received signal $\mathbf{y}_{c}$ to compute intermediate detection results.²²2We assume that the local channel matrix $\mathbf{H}_{c}$ is perfectly known at each DU since we focus on MIMO detector design in this paper. In realistic scenarios, this knowledge can be obtained by performing local channel estimation. The CU works as a coordinator for periodically aggregating the intermediate information from the DUs and forming the global consensus [6], which is then broadcast to the DUs for subsequent updates, and finally producing the estimate. In this decentralized scheme, the heavy computation burden at the CU can be offloaded to DUs and be performed in a parallel and distributed fashion. Hence, the overall computation latency would be significantly reduced. Meanwhile, the DUs can process the locally stored data to obtain low-dimensional intermediate information instead of directly transmitting full-dimensional data, thus saving interconnection bandwidth.

III Mini-Batch Gradient-Based MCMC

In this section, we first outline the basics of the gradient-based MCMC method from [27]. Then, we detail the development of the decentralized gradient-based MCMC detector, Mini-NAG-MCMC, customized for the DBP architecture.

III-A Basics of Gradient-Based MCMC

The MCMC method samples from the target posterior distribution using a Markov chain given by $\mathbf{x}_{0},\mathbf{x}_{1},\ldots,\mathbf{x}_{t},\ldots,\mathbf{x}_{S}$ , where $S$ is the number of sampling iterations. The state transition $\mathbf{x}_{t-1}\to\mathbf{x}_{t}$ in this chain is regulated by a transition kernel $T(\mathbf{x}|\mathbf{x}_{t-1})$ , specifically designed to align the chain’s stationary distribution with the target distribution. Hence, the realization of this chain can be utilized to (approximately) generate samples from the target distribution, facilitating statistical inference. Notably, the integration of MCMC and the GD method has recently emerged as a promising approach for resolving nonconvex optimization challenges [23]. The underlying concept of this gradient-based MCMC approach involves leveraging gradients and the Metropolis-Hastings (MH) criterion [31, 25] to design the transition kernel, expediting the Markov chain’s convergence to the target distribution. When using gradient-based MCMC for MIMO detection, each state of the Markov chain corresponds to a sample of $\mathbf{x}$ over the discrete space $\mathcal{A}^{U\times 1}$ . The initial sample $\mathbf{x}_{0}$ is either randomly generated or transferred from another detector. The transition from $\mathbf{x}_{t-1}$ to $\mathbf{x}_{t}$ , i.e., the generation of the $t$ -th sample $\mathbf{x}_{t}$ , includes the following steps [25, 27]:

III-A1 GD

The constraint $\mathbf{x}\in\mathcal{A}^{U\times 1}$ on the transmit vector can be relaxed to the continuous complex space $\mathbb{C}^{U\times 1}$ to allow the use of gradients. With this relaxation, the GD method can be performed to provide efficient navigation to important regions of the discrete search space. Initializing $\mathbf{z}_{0}=\mathbf{x}_{t-1}$ , where $\mathbf{x}_{t-1}$ is the previous sample, the GD iteration takes the basic form as follows:

\mathbf{z}_{k}=\mathbf{z}_{k-1}-\tau\nabla f(\mathbf{z}_{k-1}),\;k\in[N_{\rm g}],

(4)

where $k$ is the index of GD iteration, $N_{\rm g}$ is the total number of GD iterations, $\tau$ is the learning rate, $f(\cdot)$ is the objective function of MIMO detection, which is generally chosen as the residual norm from the estimate, $f(\mathbf{x})\propto\|\mathbf{y}-\mathbf{Hx}\|^{2}$ , and $\nabla f(\cdot)$ is the gradient of the objective function.

III-A2 Random Walk and QAM Mapping

After a preset number of GD steps, we generate the MCMC candidate sample $\mathbf{x}^{[\rm cand]}$ as follows:

\displaystyle\mathbf{x}^{[\rm cand]}

\displaystyle=Q(\mathbf{z}_{N_{\rm g}}+\mathbf{d}),

(5)

where $\mathbf{z}_{N_{\rm g}}$ is the output of GD, $\mathbf{d}$ is a random Gaussian vector that represents the perturbation introduced by the random walk, and the function $Q(\cdot)$ maps each element of its input to the closest QAM lattice point. The random walk prevents the algorithm from getting trapped in local minima along the GD direction. The QAM mapping discretizes the continuous update to obtain valid candidate samples.

III-A3 Acceptance Test

The MH criterion [25] is applied to accept the candidate sample with a probability given by

\alpha=\min\left\{1,\frac{\exp(-\|\mathbf{y}-\mathbf{H}\mathbf{x}^{{\rm[cand]}}\|^{2})}{\exp(-\|\mathbf{y}-\mathbf{H}\mathbf{x}_{t-1}\|^{2})}\right\},

(6)

which approximately aligns the Markov chain’s stationary distribution with the target posterior distribution. This acceptance probability of $\mathbf{x}^{{\rm[cand]}}$ can be achieved by the following rule:

\mathbf{x}_{t}=\left\{\begin{array}[]{ll}\mathbf{x}^{\rm[cand]},&\text{ if }\alpha\geq\nu\text{ (accept)},\\ \mathbf{x}_{t-1},&\text{ otherwise (reject)},\end{array}\right.

(7)

where $\nu$ is drawn from the uniform distribution $\mathcal{U}(0,1)$ .

Running the gradient-based MCMC sampling for several iterations, a sample list $\mathcal{X}=\{\mathbf{x}_{t}\}_{t=0}^{S}$ containing important samples for inference would be identified. This sample list can be used to generate the hard decision by choosing the best sample according to

\hat{\mathbf{x}}=\underset{\mathbf{x}\in\mathcal{X}}{\arg\min}\;\|\mathbf{y}-\mathbf{H}\mathbf{x}\|^{2}.

(8)

III-B Mini-Batch Gradient-Based MCMC for Decentralized MIMO Detection

Next, we present the decentralized gradient-based MCMC algorithm tailored for the DBP architecture. The central concept revolves around distributing the resource-intensive gradient computations across multiple DUs, thereby diminishing computation latency and bolstering overall efficiency.

We first provide some useful notations for ease of illustration. The local objective function at each DU $c\in[C]$ is defined as

f_{c}(\mathbf{x})=\frac{1}{2}\|\mathbf{y}_{c}-\mathbf{H}_{c}\mathbf{x}\|^{2}.

(9)

Therefore, the local gradient is given by

\nabla f_{c}(\mathbf{x})=-\mathbf{H}_{c}^{\rm H}(\mathbf{y}_{c}-\mathbf{H}_{c}\mathbf{x}).

(10)

The optimal MIMO detection problem in (2) is rewritten as

\displaystyle\mathbf{x}^{\ast}

\displaystyle=\underset{\mathbf{x}\in\mathcal{A}^{U\times 1}}{\arg\min}\;f(\mathbf{x}),

(11)

where the global objective function $f(\mathbf{x})$ is given by

f(\mathbf{x})\triangleq\frac{1}{2}\|\mathbf{y}-\mathbf{H}\mathbf{x}\|^{2}=\sum_{c=1}^{C}f_{c}(\mathbf{x}).

(12)

The decomposition of the global objective function into multiple local objective functions allows the calculation of gradients on each DU as shown in (10). These local gradients can then be collected at the CU to derive the full gradient $\nabla f(\mathbf{x})=\sum_{c=1}^{C}\nabla f_{c}(\mathbf{x})$ , which enables the GD-based optimization in (4). The distributed computations alleviate the computation burden of directly calculating the full gradient at the CU.

However, the decentralization of the standard full-batch GD in (4) does not lead to a reduction in the overall computation cost. Moreover, this decentralized full-batch GD requires information exchange between the CU and all DUs in each iteration, resulting in high interconnection costs. To further mitigate the computation and interconnection requirements associated with the full-batch GD, we turn to the mini-batch stochastic GD method [32], which shows promise in efficiently optimizing objective functions in finite-sum form as (12). This method serves as an efficient alternative to the full-batch GD in (4), offering the potential for rapid convergence and overhead reduction. Specifically, we introduce a mini-batch-based gradient updating rule tailored to the DBP architecture, as detailed below.

In each GD iteration $k$ , the estimate is not updated using the full gradient $\nabla f(\cdot)$ . Instead, a subset of DUs, termed a mini-batch and denoted as $\mathcal{I}$ , is randomly selected for the update. The mini-batch contains $m=|\mathcal{I}|$ elements. Here, $m$ represents the batch size, and we assume that the number of antenna clusters $C$ is divisible by $m$ . The mini-batch-based GD iteration operates as

\mathbf{z}_{k}=\mathbf{z}_{k-1}-\tau\nabla f_{\mathcal{I}}(\mathbf{z}_{k-1}),

(13)

where $\nabla f_{\mathcal{I}}(\cdot)$ denotes the estimated gradient based on the mini-batch $\mathcal{I}$ . This is expressed as

\nabla f_{\mathcal{I}}(\mathbf{z}_{k-1})=\frac{C}{m}\sum_{c\in\mathcal{I}}\nabla f_{c}(\mathbf{z}_{k-1}),

(14)

where the scaling factor $C/m$ ensures the unbiasedness of the mini-batch gradient as an estimator of the full gradient [32], denoted as,

\mathbb{E}[\nabla f_{\mathcal{I}}(\mathbf{x})]=\nabla f(\mathbf{x}),

(15)

with the expectation taken across any possible selections of mini-batch $\mathcal{I}$ . This guarantees the asymptotic convergence to the minimum of the continuous-relaxed version of problem (11) via the mini-batch GD method.

Remark 1

As shown in (14), only a portion of all DUs are needed for gradient computations and information exchange with the CU per mini-batch GD iteration, thus providing significant design flexibility for balancing performance and complexity by adjusting the mini-batch size. Moreover, simulation results reveal that employing mini-batch-based gradients with a sufficiently large mini-batch size (e.g., $m=C/2$ ) leads to virtually identical performance compared to using full gradients. Hence, the proposed approach enables substantial savings in computational resources and interconnection bandwidth.

With the proposed mini-batch-based gradient updating rule in mind, we next introduce the Mini-NAG-MCMC algorithm for decentralized MIMO detection. For ease of presentation, in the remainder of this paper, we consider the star topology as displayed in Fig. 2 unless noted otherwise. We assume that a total of $S$ samples are generated to perform the final inference for MIMO detection. The generation of each sample $\mathbf{x}_{t}$ is divided into two stages as detailed in the following.

III-B1 Mini-Batch GD Stage

In the mini-batch GD stage, we select NAG as the GD component due to its fastest convergence rate among the first-order gradient-based optimization algorithms [28]. NAG incorporates momentum in the update rule to expedite and stabilize convergence. To integrate the NAG method into the DBP architecture, in each GD iteration $k$ , the incorporation of momentum is performed at the CU to derive the intermediate point $\boldsymbol{p}_{k}$ as

\boldsymbol{p}_{k}=\mathbf{z}_{k-1}+\rho_{k}\Delta\mathbf{z}_{k-1},

(16)

where $\Delta\mathbf{z}_{k-1}=\mathbf{z}_{k-1}-\mathbf{z}_{k-2}$ is the previous update stored at the CU ( $\Delta\mathbf{z}_{0}=\mathbf{0}$ ). The momentum factor $\rho_{k}\in[0,1]$ determines the proportion of the previous update that is added to the current update.

The intermediate point $\boldsymbol{p}_{k}$ is then broadcast to DUs belonging to the randomly chosen mini-batch $\mathcal{I}$ . Upon receiving the broadcast, each DU $c\in\mathcal{I}$ computes the local gradient $\nabla f_{c}(\boldsymbol{p}_{k})$ in parallel based on (10). These local gradients are then uploaded and aggregated for performing the descent from $\boldsymbol{p}_{k}$ and updating the momentum at the CU:

	$\displaystyle{\mathbf{z}}_{k}$	$\displaystyle={\boldsymbol{p}}_{k}-{\tau\frac{C}{m}}\sum_{c\in\mathcal{I}}\nabla f_{c}(\boldsymbol{p}_{k}),$		(17)
	$\displaystyle\Delta{\mathbf{z}}_{k}$	$\displaystyle={\mathbf{z}}_{k}-{\mathbf{z}}_{k-1}.$		(18)

The NAG procedure is iteratively conducted until a preset number of iterations $N_{\rm g}$ is reached. The output of NAG, $\mathbf{z}_{N_{\rm g}}$ , is then used as the input for the subsequent MCMC sampling.

One critical parameter that requires careful selection in the NAG is the learning rate $\tau$ . From [28], a general learning rate choice for an $L$ -Lipschitz smooth objective function $f(\cdot)$ is

0<\tau\leq\frac{1}{L},

(19)

where $L$ is the Lipschitz constant, and choosing the upper bound $\tau=\frac{1}{L}$ is beneficial for convergence. For the objective function $f(\cdot)$ in (12), we have

L=\lambda_{\max}(\mathbf{G}),

(20)

where $\mathbf{G}=\mathbf{H}^{\rm H}\mathbf{H}$ is the global Gram matrix, and $\lambda_{\max}(\cdot)$ takes the largest eigenvalue of $\mathbf{G}$ [27, 33]. To avoid the computationally expensive eigenvalue decomposition, it is recommended in [27] to replace $\lambda_{\max}(\mathbf{G})$ with its upper bound $\|\mathbf{G}\|_{F}$ and set the learning rate as

\tau=\frac{1}{\|\mathbf{G}\|_{F}}.

(21)

A straightforward way to derive $\tau$ in (21) under the DBP architecture is to accumulate the local Gram matrices from all DUs, forming the global Gram matrix at the CU for calculating (21). Specifically, during the preprocessing period, each DU $c$ computes the local Gram matrix,

\mathbf{G}_{c}=\mathbf{H}_{c}^{\rm H}\mathbf{H}_{c},\;c\in[C],

(22)

which is then collected at the CU to derive the global Gram matrix, $\mathbf{G}=\sum_{c=1}^{C}\mathbf{G}_{c}$ .

Under the so-called favorable propagation in massive MIMO systems [34], the computation of $\tau$ can be further simplified. In such cases, we have

\frac{1}{B}\mathbf{h}_{i}^{\rm H}\mathbf{h}_{j}\to 0,\;i\neq j,\;i,j\in[U],\quad B\to\infty,

(23)

where $\mathbf{h}_{i}$ denotes the $i$ -th column vector of $\mathbf{H}$ . This equation indicates that, for MIMO uplink with a large number of BS antennas, the Gram matrix $\mathbf{G}$ is diagonally dominant and can be approximated by its diagonal counterpart $\mathbf{D}$ given as

\mathbf{D}={\rm diag}(\mathbf{G})={\rm diag}\left(\sum_{c=1}^{C}\mathbf{G}_{c}\right)=\sum_{c=1}^{C}\mathbf{D}_{c},

(24)

where $\mathbf{D}_{c}\in\mathbb{R}^{U\times U}$ is the diagonal counterpart of the local Gram matrix $\mathbf{G}_{c}$ and can be calculated using the column vector $\mathbf{h}_{u,c},u\in[U]$ of $\mathbf{H}_{c}$ :

(\mathbf{D}_{c})_{ij}=\left\{\begin{array}[]{ll}\|\mathbf{h}_{u,c}\|^{2},&\text{ when }i=j=u\\ 0,&\text{ otherwise }\end{array}\right..

(25)

Finally, the learning rate $\tau$ in (21) can be approximated as

\tau\approx\frac{1}{\|\mathbf{D}\|_{F}}=\frac{1}{\sqrt{\sum_{i=1}^{U}(\mathbf{D})_{ii}^{2}}}.

(26)

Remark 2

With the approximation $\mathbf{G}\approx\mathbf{D}$ , the complexity on the order of $\mathcal{O}(B_{c}U^{2})$ for calculating (22) at each DU is substituted by the low complexity of $\mathcal{O}(B_{c}U)$ for calculating (25). Meanwhile, only the $U$ real-valued diagonal elements of $\mathbf{D}_{c}$ are required to be exchanged and accumulated, thereby alleviating the high interconnection bandwidth to convey the entire Gram matrix.

III-B2 MCMC Sampling Stage

After a sufficient GD procedure to refine the estimate, the candidate sample is generated at the CU by adding random perturbation to ${\mathbf{z}}_{N_{\rm g}}$ and mapping the perturbed estimate to the QAM lattice using (5). The random Gaussian vector $\mathbf{d}$ in (5) is given by

\mathbf{d}=\gamma\mathbf{M}_{\rm c}\mathbf{w},

(27)

where $\gamma$ is the magnitude of the random perturbation, i.e., step size of the random walk, $\mathbf{M}_{\rm c}\in\mathbb{C}^{U\times U}$ is the Cholesky factor specifying the covariance of the random perturbation as $\mathbf{M}_{\rm c}\mathbf{M}_{\rm c}^{\rm H}$ , and $\mathbf{w}$ follows $\mathcal{CN}(\mathbf{0},\mathbf{I})$ . The step size $\gamma$ is selected to enable sufficient ability to escape local minima while maintaining a reasonable acceptance probability [35]. Moreover, according to MCMC literature [36, 37], $\mathbf{M}_{\rm c}$ can be set to specify a covariance proportional to the inverse of the Hessian matrix of the objective function. However, this setting is computationally expensive. Experimental observations indicate that using an identity matrix $\mathbf{I}$ for $\mathbf{M}_{\rm c}$ causes almost no performance degradation compared to the former setting, while substantially reducing the computational cost.

Subsequently, the CU performs the MH acceptance test, where the candidate sample $\mathbf{x}^{[\rm cand]}$ is accepted with a probability computed according to (6). This probability can be alternatively expressed as

\displaystyle\alpha

\displaystyle=\min\left\{1,\exp\left(2f(\mathbf{x}_{t-1})-2f(\mathbf{x}^{[\mathrm{cand}]})\right)\right\}.

(28)

Evaluation of the global objective function $f(\cdot)$ requires the knowledge of $\mathbf{H}$ and $\mathbf{y}$ , which is not directly accessible to the CU. Instead, we evaluate the local objective function at each DU and then aggregate them at the CU. Specifically, for the acceptance test of the $t$ -th sampling iteration, the CU first broadcasts the candidate sample $\mathbf{x}^{\rm[cand]}$ to all the DUs. Then, each DU independently evaluates the local objective function in (9) and uploads the value $f_{c}(\mathbf{x}^{\rm[cand]})$ to the CU for aggregation to derive

f(\mathbf{x}^{\rm[cand]})=\sum_{c=1}^{C}f_{c}(\mathbf{x}^{\rm[cand]}).

(29)

This global objective function value is subsequently used to compute the acceptance probability $\alpha$ in (28). The value $f(\mathbf{x}^{\rm[cand]})$ is stored or discarded based on the acceptance or rejection of the candidate sample.

f(\mathbf{x}_{t})=\left\{\begin{array}[]{ll}f(\mathbf{x}^{\rm[cand]}),&\text{ when }\mathbf{x}^{\rm[cand]}\text{ is accepted}\\ f(\mathbf{x}_{t-1}),&\text{ otherwise }\end{array}\right..

(30)

Finally, these global objective function values $\{f(\mathbf{x}_{t})\}_{t=0}^{S}$ for the drawn samples are used for generating the final estimate.

Input:

\mathbf{H}_{c}

\mathbf{y}_{c}

c\in[C]

, mini-batch size

m

, momentum factor

\rho_{k}

, random walk step size

\gamma

and covariance

\mathbf{M}_{\rm c}\mathbf{M}_{\rm c}^{\rm H}

, number of NAG iterations

N_{\rm g}

, number of sampling iterations

S

Preprocessing:

Each DU

c\in[C]

Calculate

\mathbf{D}_{c}

via (25) and upload its diagonal to the CU.

CU:

Initial sample

\mathbf{x}_{0}

and

f(\mathbf{x}_{0})

. Calculate the learning rate

\tau

via

(24) and (26).

Core: for $t=1$ to $S$ do

Mini-batch GD stage:

// Perform

N_{\rm g}

NAG iterations with

{\mathbf{z}}_{0}=\mathbf{x}_{t-1}

for $k=1$ to $N_{\rm g}$ do

CU:

Add momentum to derive

\boldsymbol{p}_{k}

via (16);

Randomly select a mini-batch

\mathcal{I}

and broadcast

\boldsymbol{p}_{k}

to the

DUs within

\mathcal{I}

Each DU

c\in\mathcal{I}

Compute local gradients

\nabla f_{c}(\boldsymbol{p}_{k})

via (10) and upload it

to the CU.

CU:

Aggregate

\nabla f_{c}(\boldsymbol{p}_{k})

to update

\mathbf{z}_{k}

and

\Delta\mathbf{z}_{k}

via (17) and

(18).

MCMC sampling stage:

CU:

Derive

\mathbf{x}^{[\rm cand]}

via (5) and (27);

Broadcast

\mathbf{x}^{[\rm cand]}

to all the DUs.

Each DU

c\in[C]

Calculate

f_{c}(\mathbf{x}^{[\rm cand]})

via (9) and upload it to the CU.

CU:

Aggregate

f_{c}(\mathbf{x}^{[\rm cand]})

to derive

f(\mathbf{x}^{[\rm cand]})

via (29);

Calculate

\alpha

via (28) and

\nu\sim\mathcal{U}(0,1)

;

\alpha\geq\nu

, set

\mathbf{x}_{t}=\mathbf{x}^{[\rm cand]}

; otherwise, set

\mathbf{x}_{t}=\mathbf{x}_{t-1}

;

\alpha\geq\nu

, set

f(\mathbf{x}_{t})=f(\mathbf{x}^{[\rm cand]})

; otherwise, set

f(\mathbf{x}_{t})=f(\mathbf{x}_{t-1})

Output:

\hat{\mathbf{x}}=\underset{\mathbf{x}\in\mathcal{X}}{\arg\min}\;f(\mathbf{x})

, where

\mathcal{X}=\{\mathbf{x}_{t}\}_{t=0}^{S}

Algorithm 1 Mini-NAG-MCMC

The above two stages are alternatively executed $S$ times for the Markov chain to converge to the steady state and derive a sufficient number of samples that follow the posterior distribution, forming the sample list $\mathcal{X}$ for statistical inference at the CU. Moreover, parallel samplers can be deployed to reduce sampling latency and correlation. In hard decision, the final estimate $\hat{\mathbf{x}}$ can be obtained by selecting the sample from $\mathcal{X}$ that minimizes the objective function $f(\cdot)$ . Details of the Mini-NAG-MCMC algorithm are summarized in Algorithm 1. In the Appendix, we provide the proof of the convergence of the proposed algorithm in a simplified setting.

Remark 3

It is worth mentioning that the mini-batch gradient-based MCMC method is also applicable to the daisy-chain topology. The difference lies in the exchange of information among processing units. Without loss of generality, we assume DU $C$ is within the selected mini-batch in both the star and daisy-chain topologies. For the daisy-chain topology, we assume the DUs are connected in the same order as shown in Fig. 1. Taking the information upload stage as an instance, DU $C$ in the star topology directly uploads its local gradient to the CU as shown in Fig. 2; By contrast, DU $C$ in the daisy-chain topology needs to accumulate the gradients of its own with those from the former DUs within the mini-batch to derive $\nabla f_{\mathcal{I}}(\cdot)\propto\sum_{c\in\mathcal{I}}\nabla f_{c}(\cdot)$ and then uploads it to the CU. This sequential information exchange results in an increased data transfer latency.

Remark 4

Note that soft outputs can also be generated based on the sample list, as shown in [20, 21, 27]. The detailed derivations are beyond the scope of this paper and thus omitted.

Remark 5

Although the use of multiple computing units in decentralized detection inevitably results in additional energy consumption, the proposed decentralized detector allows for energy-efficient implementation by significantly reducing the computation of each DU. This enables DUs to operate in a more energy-efficient region with a lower computational density compared to the CU in the centralized scheme. Moreover, the mini-batch-based gradient updating rule can further increase energy efficiency by involving only a subset of DUs in gradient computations per iteration while keeping others inactive.

IV Complexity Analysis

In this section, we analyze the complexity of the proposed Mini-NAG-MCMC from two aspects: computational complexity and interconnection bandwidth.

TABLE I: Computational Complexity of Centralized and Decentralized Detection Algorithms

Algorithms		Computational complexity
Algorithms		Each DU	CU
Decentralized	Mini-NAG-MCMC	$\mathcal{O}\big{(}B_{c}U(N_{\rm g}S+\sqrt{M})\big{)}$	$\mathcal{O}\big{(}SU(N_{\rm g}+\sqrt{M})\big{)}$
	DeADMM $B_{c}\geq U$	$\mathcal{O}\big{(}TU^{2}+U^{3}+B_{c}U^{2}\big{)}$	$\mathcal{O}(TU)$
	DeADMM $B_{c}<U$	$\mathcal{O}\big{(}T(B_{c}^{2}+B_{c}U)+B_{c}^{3}\big{)}$	$\mathcal{O}(TU)$
	DeLAMA	$\mathcal{O}\big{(}T(U^{2}+\sqrt{M}U)+B_{c}U^{2}\big{)}$	$\mathcal{O}(CU)$
	DeEP	$\mathcal{O}(TU^{3}+B_{c}U^{2})$	$\mathcal{O}\big{(}TU(\sqrt{M}+C)\big{)}$
	DeNewton	$\mathcal{O}(TB_{c}U+B_{c}U^{2})$	$\mathcal{O}(TU^{2}+U^{3})$
Centralized	LMMSE	$\mathcal{O}(U^{3}+BU^{2})$
	$K$ -best	$\mathcal{O}(BU^{2}+K\sqrt{M}U^{2})$
	NAG-MCMC	$\mathcal{O}\big{(}S(N_{\rm g}U^{2}+\sqrt{M}U+B)+BU^{2}+\sqrt{M}BU\big{)}$

Note: $T$ is the number of iterations of the iterative detection schemes; $K$ is the list size of the $K$ -best detector; $\mathbf{M}_{\rm c}=\mathbf{I}$ is used for NAG-MCMC and Mini-NAG-MCMC.

IV-A Computational Complexity

Table I presents the computational complexity of Mini-NAG-MCMC compared with various centralized and decentralized detection algorithms. The decentralized baselines include the decentralized ADMM (DeADMM) [6], LAMA (DeLAMA)³³3We consider the fully decentralized version of the LAMA detector proposed in [12]. [12], EP (DeEP) [14], and Newton (DeNewton) [9] detectors. Centralized LMMSE [38], $K$ -best sphere decoding [40], and NAG-MCMC [27] algorithms are also compared. For the considered decentralized schemes, operations in the CU and each DU are counted separately. We only present the count of a single DU to reflect the parallel processing of the DBP architecture. We remark that while this count is lower than the total computational cost from all local processing units, it reflects the computation latency of the algorithm.

In the comparison, the complexity of a $U\times U$ matrix inversion is $\mathcal{O}(U^{3})$ , where $\mathcal{O}(\cdot)$ denotes the standard asymptotic notation [39, Ch. 3.1] to indicate the order of complexity. The complexity of Mini-NAG-MCMC is dominated by gradient computations and objective function evaluation, costing $\mathcal{O}(N_{\rm g}SB_{c}U)$ and $\mathcal{O}(\sqrt{M}B_{c}U)$ at each DU, respectively. The other operations in the MCMC sampling stage, such as the random walk, QAM mapping, and acceptance test, incur a modest cost of $\mathcal{O}\big{(}SU(N_{\rm g}+\sqrt{M})\big{)}$ multiplications at the CU. The comparison in Table I shows that Mini-NAG-MCMC is highly efficient due to avoiding high-complexity matrix multiplications and inversions.

Table I also reveals that the proposed Mini-NAG-MCMC effectively relieves the computation burden at the CU compared to the centralized algorithms by shifting the computationally intensive gradient computations to the DUs. As a result, the computational complexity at each processing unit is independent of the number of BS antennas $B$ , which is beneficial for reducing computation latency. Furthermore, the proposed Mini-NAG-MCMC possesses the advantage that only a portion of the DUs is required to perform computations in each mini-batch GD iteration of the algorithm. Thus, the total computational cost can be saved to a large extent compared to other decentralized schemes.

Note that the number of iterations also has a significant impact on the complexity of the proposed method and other iterative detection schemes in Table I. However, it is practically difficult to provide an analytical estimate of the number of iterations required to reach a certain performance level. Hence, we resort to empirically evaluating the computational complexity of the detectors listed in Table I and meanwhile taking into account the achieved performance. Detailed results are shown in Sec. V.

IV-B Interconnection Bandwidth

The interconnection bandwidth is measured by the number of bits transferred from and to the CU within a symbol period. Denote by $\omega$ the bit-width of each real value (real/imaginary parts of a complex scalar number) in the implementation. Note that the representation of each QAM symbol in the derived samples only requires $\log_{2}M$ bits, which can be significantly smaller than $\omega$ . For centralized schemes, the entities required to be uploaded to the CU include the full channel matrix $\mathbf{H}$ and received signal $\mathbf{y}$ , resulting in the bandwidth

C_{\text{centralized}}=2(BU+B)\omega.

(31)

For the star topology-based Mini-NAG-MCMC, the CU receives $UC$ and $2N_{\rm g}SUm+(S+1)C$ real values during preprocessing and sampling iterations, respectively. The CU also broadcasts $2N_{\rm g}SUm$ continuous-valued real numbers and $(S+1)UC$ QAM symbols during the sampling. For the daisy-chain topology-based Mini-NAG-MCMC, the numbers that are required to be transferred are identical to those in the star topology. However, the CU requires communication with only one directly connected DU, thereby significantly reducing interconnection costs. Hence, the interconnection bandwidth of Mini-NAG-MCMC with star and daisy-chain topologies are respectively given by

$\displaystyle C_{\text{Mini}}^{\text{star}}=$	$\displaystyle 4N_{\rm g}SU\omega\cdot m+(S+1)U\log_{2}M\cdot C$
	$\displaystyle+(S+1+U)\omega\cdot C,$	(32)
$\displaystyle C_{\text{Mini}}^{\text{chain}}=$	$\displaystyle 4N_{\rm g}SU\omega+(S+1)U\log_{2}M+(S+1+U)\omega.$	(33)

The results of both topologies do not depend on the BS antenna count $B$ , a notable advantage over centralized schemes.

We note that if we consider the interconnection cost on all links in the daisy-chain topology, the total interconnection bandwidth would be comparable to that of the star topology. However, the daisy-chain architecture can uniformly allocate the bandwidth demand among all links, resulting in a significant reduction in the individual requirements on each link [7]. Notwithstanding, the price to pay is extended processing latency as a result of the sequential processing architecture.

V Simulation Results

In this section, we present numerical results to evaluate the proposed Mini-NAG-MCMC in terms of error rate performance, computational complexity, and interconnection cost.

V-A Simulation Setup

Both IID Rayleigh fading and realistic 3GPP channel models are considered in the simulation. The Rayleigh fading MIMO channel has elements that follow the Gaussian distribution $\mathcal{CN}(0,1/B)$ . For realistic channels, the 3D channel model outlined in the 3GPP 36.873 technical report [41] is utilized. This model has been widely adopted for its accurate representation of real-world channel environments. We consider both uncoded and coded MIMO systems. For uncoded systems, we evaluate the bit error rate (BER) performance until $5\times 10^{7}$ bits are transmitted or the number of bit errors exceeds 1000. Different independent channel realizations are considered for each symbol transmission. For coded systems, we employ a rate- $3/4$ low-density parity-check (LDPC) code and the OFDM scheme, wherein each user’s coded bits are converted to QAM symbols and mapped onto 216 subcarriers for data transmission. The detector calculates soft outputs using max-log approximation [20]. Belief propagation decoding is then utilized with a maximum number of 10 iterations. The block error rate (BLER) is evaluated until $10^{5}$ coded blocks are transmitted with independent channel realizations.

We compare our method with the centralized LMMSE and NAG-MCMC [27] schemes and representative decentralized detector baselines as listed in Table I. The optimal ML [3] and near-optimal $K$ -best [40] detectors are used as performance benchmarks. For parameter settings of NAG-MCMC and the proposed Mini-NAG-MCMC, the number of NAG iterations is set to $N_{\rm g}=4$ . The choice of momentum factor $\rho_{k}$ follows the recommended setting in [28]:

\mu_{0}=1,\;\mu_{k}=\frac{1+\sqrt{1+4\mu_{k-1}^{2}}}{2},\;\rho_{k}=\frac{\mu_{k-1}-1}{\mu_{k}},\;k\in[N_{\rm g}].

(34)

The random walk step size is empirically set to $\gamma=0.05$ . To ensure computational efficiency, we set $\mathbf{M}_{\rm c}=\mathbf{I}$ in (27). The initial sample $\mathbf{x}_{0}$ for both NAG-MCMC and Mini-NAG-MCMC is randomly selected from $\mathcal{A}^{U\times 1}$ .

V-B Convergence Analysis

Fig. 3 presents the convergence performance of Mini-NAG-MCMC in terms of the BER versus the number of sampling iterations.⁴⁴4Note that conducting a rigorous theoretical analysis of convergence is challenging for gradient-based MCMC methods in discrete search spaces [22], [26]. Therefore, we have attempted to empirically evaluate the convergence property via numerical simulations, aiming to provide some valuable insights. The MIMO system is configured with $B=128$ and $U=8$ or $32$ . The system employs 16-QAM modulation and operates under IID Rayleigh fading channels.

Fig. 3(a) shows the convergence behavior under different choices of mini-batch sizes. We also provide the convergence performance of the centralized NAG-MCMC algorithm as a reference. The results indicate that Mini-NAG-MCMC converges faster as the mini-batch size $m$ increases in both $U=8$ and $U=32$ cases. This result is in accordance with expectation because the variance of the mini-batch gradient approximation in (14) is a decreasing function of $m$ [42]. Meanwhile, the convergence speed is virtually unaffected by the number of users, revealing the robustness of Mini-NAG-MCMC to varying user loads.

Remarkably, the Mini-NAG-MCMC with $m=16$ achieves virtually the same convergence performance as the centralized NAG-MCMC that uses full gradients. This phenomenon can be attributed to the sufficient accuracy of the mini-batch approximation when $m$ is sufficiently large (e.g., $m=C/2$ in this simulation) and the synergistic effect of stochastic gradients and MCMC in escaping local minima. Consequently, the proposed method achieves equivalent performance to the centralized counterpart that computes full gradients, while offering the benefit of reduced computational cost since only half of all the DUs are utilized for computations per iteration.

Fig. 3(b) compares the convergence performance under different numbers of clusters, with the mini-batch size fixed as $m=C/2$ . The figure reveals that the convergence speed shows minimal variation with respect to the number of clusters, which is a notable advantage of the proposed Mini-NAG-MCMC. Hence, the proposed method can accommodate to different levels of decentralization, allowing for trade-offs between the computational complexity at each DU and the number of interconnection interfaces (as well as bandwidth) by flexibly adjusting the system parameter $C$ according to different requirements.

V-C Uncoded Detection Performance

In this subsection, we investigate the proposed method under Rayleigh fading MIMO channels in uncoded systems. Fig. 4 provides the BER performance and computational complexity for a massive MIMO system with $B=128$ BS antennas, $U=32$ users, $C=32$ clusters, and 16-QAM modulation. The mini-batch size for Mini-NAG-MCMC is $m=16$ . We compare the proposed method with centralized LMMSE [38], $K$ -best [40], and NAG-MCMC [27] detectors and two representative decentralized baselines, namely the DeADMM [6] and DeEP [14] detectors.⁵⁵5For the implementation of the DeADMM scheme, we use the code released at https://github.com/VIP-Group/DBP. The list size for the $K$ -best detector is set to $K=4$ , serving as a near-optimal baseline. The number of iterations for DeADMM and DeEP is set as $T=50$ and $T=5$ , respectively, which is necessitated by the convergence of the algorithm in the considered system setup.

Fig. 4(a) shows that the LMMSE and DeADMM detectors have a significant performance gap to the ML detector. In contrast, Mini-NAG-MCMC achieves a noticeable performance gain over LMMSE and DeADMM when the number of sampling iterations is $S=6$ and further approaches the ML performance with a slight increase in $S$ . Furthermore, Mini-NAG-MCMC obtains comparable BER performance to NAG-MCMC when both methods are equipped with the same $S$ .

Fig. 4(b) illustrates the complexity bar chart of different detectors in terms of the number of real-valued multiplications required per detection. The number of multiplications at the CU for Mini-NAG-MCMC is much lower than the counterparts at the CU for the centralized LMMSE, $K$ -best, and NAG-MCMC detectors. We remark that despite the total computational cost of Mini-NAG-MCMC being comparable to the centralized LMMSE and NAG-MCMC, the proposed decentralized algorithm offers dual advantages: First, the complexity at each DU is notably low, effectively leading to a reduced computation latency. Second, the distribution of complexity among many DUs allows the use of cheap computing hardware within each processing unit, in contrast to the centralized algorithms that exert the entire complexity burden on the CU. Furthermore, the computational cost of Mini-NAG-MCMC is also similar to that of the DeADMM and substantially lower than that of the DeEP.

Considering its near-ML performance and low complexity at each unit, as shown in Fig. 4, our proposed Mini-NAG-MCMC demonstrates high promise in achieving an attractive trade-off between performance and complexity.

Fig. 5 shows the results when the number of users is increased to $U=48$ , with the other system settings remaining unchanged compared to Fig. 4. Fig. 5(a) indicates that the performance of LMMSE, DeADMM, and DeEP further degrades as compared to the near-optimal $K$ -best detector when the user number increases. However, Mini-NAG-MCMC demonstrates significant performance improvement over LMMSE when $S=6$ and achieves comparable performance to the $K$ -best when $S=12$ , consistent with the results presented in Fig. 4(a). This finding reflects that the proposed Mini-NAG-MCMC and the associated learning rate approximation are robust to the increased user load. Moreover, upon comparing Fig. 5(b) to Fig. 4(b), it is observed that the complexity increase of Mini-NAG-MCMC is subtle in contrast to the LMMSE, $K$ -best, and DeEP detectors. Specifically, the complexity increase is 50% for Mini-NAG-MCMC, whereas the corresponding values for LMMSE, $K$ -best, and DeEP are 165%, 125%, and 230%, respectively, demonstrating the outstanding scalability of Mini-NAG-MCMC as revealed in Sec. IV-A.

Fig. 6 presents the BER performance under high-order 64-QAM/256-QAM modulation schemes. The other system configurations are the same as Fig. 5. In this experimental setup, the $K$ -best detector is configured with $K=8$ to provide a near-ML baseline. The DeADMM and DeEP detectors execute for $T=50$ and $T=10$ iterations, respectively. The proposed Mini-NAG-MCMC is set up with $S=24$ iterations and a mini-batch size of $m=16$ . The figure illustrates that the proposed method achieves a performance gain of more than 1.5 dB when compared to the LMMSE detector and the decentralized baselines and approaches the performance of the near-optimal $K$ -best. These results demonstrate that the proposed decentralized detector scales to high-order modulation.

V-D Coded Detection Performance

We further investigate the proposed method under the more realistic 3GPP 3D MIMO channel model [41] with coded transmissions. We examine the urban macrocell non-line-of-sight scenario. For the simulation of this scenario, the QuaDRiGa simulator [43] is employed. In the considered scenario, the BS is equipped with $B=128$ antennas, which are uniformly partitioned into $C=32$ clusters, to serve $U=8$ users. The BS adopts a uniform planar antenna array with half-wavelength antenna spacing. The users are evenly dropped within a $120^{\circ}$ cell sector centered around the BS, spanning a radius ranging from 10 m to 500 m. The carrier frequency is set as 2.53 GHz, and the channel spans over 20 MHz bandwidth. To cope with the complicated channel environments, the $K$ -best detector is set up with $K=8$ . The number of iterations for DeADMM and DeEP is selected as $T=50$ and $T=10$ , respectively. The number of iterations and mini-batch size for Mini-NAG-MCMC are $S=64$ and $m=16$ , respectively.

As shown in Fig. 7, Mini-NAG-MCMC demonstrates stable performance, exhibiting significant robustness to the realistic channel model. Specifically, Mini-NAG-MCMC outperforms LMMSE and other decentralized detectors, achieving an approximate 1 dB gain under both 16-QAM and 64-QAM modulation. Furthermore, the gap between Mini-NAG-MCMC and the near-optimal performance achieved under centralized schemes by the $K$ -best detector is a mere 0.4 dB. Considering the distribution of complexity across DUs, achieving near-optimal performance in this context is indeed promising.

V-E Interconnection Bandwidth Comparison

We evaluate the interconnection cost of the proposed Mini-NAG-MCMC in this subsection. The bit-width is selected as $\omega=16$ in the evaluation. The BER performance versus interconnection bandwidth of Mini-NAG-MCMC with star and daisy-chain topologies for a MIMO system with $B=256$ , $U=8$ , and $C=8$ is presented in Fig. 8. The mini-batch size for Mini-NAG-MCMC is set to $m=2$ . In the figure, the $x$ -axis represents the percentage of interconnection bandwidth of Mini-NAG-MCMC relative to the centralized algorithm, calculated according to (31)-(33) by setting the number of sampling iterations $S$ to $[2,4,6,8]$ . The $y$ -axis represents the performance gap between Mini-NAG-MCMC and the centralized ML algorithm in terms of the SNR required to achieve a target $\text{BER}=10^{-3}$ . From the figure, we have the following observations:

•

Mini-NAG-MCMC achieves a flexible trade-off between performance and interconnection bandwidth.
•

The daisy-chain topology provides lower interconnection bandwidth than the star topology and is more appreciated in interconnection bandwidth-constrained scenarios, despite a larger processing delay.

Furthermore, Fig. 9 presents the interconnection bandwidth of Mini-NAG-MCMC compared with the centralized algorithm as a function of the number of BS antennas $B$ when $U=8$ and $C=8$ . The mini-batch size and the number of sampling iterations for Mini-NAG-MCMC are $m=2$ and $S=4$ , respectively. The figure shows that the proposed method with both star and daisy-chain topologies exhibits significantly reduced interconnection cost compared to the centralized processing as $B$ increases. This result confirms the scalability of the proposed method to massive MIMO systems with a large number of BS antennas.

V-F Trade-off between Performance and Complexity

Fig. 10 presents the performance-complexity trade-off of various detectors for a massive MIMO system with $B=128$ , $U=32$ , $C=4$ , 16-QAM modulation, and Rayleigh fading channels. All the detectors are implemented using the star DBP topology. The figure reveals that the proposed Mini-NAG-MCMC ( $m=2$ ) achieves the most promising results among the considered detectors in the trade-off between BER performance, computational complexity, and interconnection bandwidth. Despite the higher interconnection bandwidth of Mini-NAG-MCMC compared to DeLAMA and DeEP, the gap can be narrowed by deploying the proposed detector using the daisy-chain topology. This result highlights the potential of the proposed method as a highly implementable and desirable solution for massive MIMO detectors under DBP architectures.

VI Conclusion

In this paper, the gradient-based MCMC method has been utilized for developing a novel decentralized detector for massive MIMO systems operating within the DBP architecture. Our approach involves decentralizing the gradient-based MCMC technique, distributing the computationally demanding gradient computations among multiple DUs for parallel processing. These DUs are coordinated by a CU that engages in MCMC sampling to derive a list of important samples for MIMO detection. This design enables a balanced allocation and efficient use of computational resources. We have also incorporated the concept of mini-batch GD in the design of the decentralized detector, leading to reduced computation and interconnection demands across DUs while preserving high performance. Through extensive evaluation, we have demonstrated the clear superiority of the proposed Mini-NAG-MCMC over existing decentralized detectors, notably in terms of BER. Furthermore, a complexity analysis solidifies the proposed method’s edge in effectively balancing performance, computational overhead, and interconnection bandwidth.

[Convergence of the Proposed Algorithm] We illustrate the proposed algorithm’s convergence in a simplified setting where the algorithm uses $\mathbf{M}_{\rm c}=\mathbf{I}$ , as employed in the experiments. Specifically, at each sampling iteration, a candidate state $\mathbf{x}^{\prime}$ is generated as follows

\mathbf{x}^{\prime}=Q(\mathbf{x}-\tau\nabla f_{\mathcal{I}}(\mathbf{x})+\gamma\mathbf{w}),

(35)

where $\mathbf{x}$ denotes the previous state. By setting the global objective function as $f(\mathbf{x})\triangleq\frac{1}{2}\|\mathbf{y}-\mathbf{H}\mathbf{x}\|^{2}$ , we target the tempered posterior distribution given by

	$\displaystyle\pi(\mathbf{x})$	$\displaystyle\propto p(\mathbf{x})p(\mathbf{y}\|\mathbf{x})^{1/T_{\rm p}}$
		$\displaystyle\propto p(\mathbf{x})\exp\left(-{\\|\mathbf{y}-\mathbf{Hx}\\|^{2}}\right),$		(36)

where $T_{\rm p}=1/\sigma^{2}$ is the temperature parameter [20, 44]. Assuming that the prior distribution $p(\mathbf{x})$ is equiprobable, the standard MH criterion [31] calculates the acceptance probability of $\mathbf{x}^{\prime}$ as

	$\displaystyle A(\mathbf{x}^{\prime}\|\mathbf{x})$	$\displaystyle=\min\left\{1,\frac{\pi(\mathbf{x}^{\prime})q(\mathbf{x}\|\mathbf{x}^{\prime})}{\pi(\mathbf{x})q(\mathbf{x}^{\prime}\|\mathbf{x})}\right\}$
		$\displaystyle=\min\left\{1,\frac{\exp(-\\|\mathbf{y}-\mathbf{H}\mathbf{x}^{\prime}\\|^{2})q(\mathbf{x}\|\mathbf{x}^{\prime})}{\exp(-\\|\mathbf{y}-\mathbf{H}\mathbf{x}\\|^{2})q(\mathbf{x}^{\prime}\|\mathbf{x})}\right\}.$		(37)

However, the proposal probability $q(\cdot|\cdot)$ in (37), which corresponds to (35), is computationally expensive. Specifically, considering the discrete space $\mathcal{A}^{U\times 1}$ , the proposal probability $q(\mathbf{x}^{\prime}|\mathbf{x})$ can be expressed as [45]

\displaystyle q

\displaystyle(\mathbf{x}^{\prime}|\mathbf{x})=\frac{\exp\left(-\frac{1}{\gamma^{2}}\|\mathbf{x}^{\prime}-\mathbf{x}+\tau\nabla f_{\mathcal{I}}(\mathbf{x})\|^{2}\right)}{Z_{\mathcal{A}}(\mathbf{x})}\prod_{u=1}^{U}\mathbb{I}_{x_{u}^{\prime}\in\mathcal{A}},

(38)

where $x_{u}^{\prime}$ is the $u$ -th element of $\mathbf{x}^{\prime}$ , $\mathbb{I}_{x_{u}^{\prime}\in\mathcal{A}}$ denotes an indicator function that returns one only if $x_{u}^{\prime}\in\mathcal{A}$ and zero otherwise, and $Z_{\mathcal{A}}(\mathbf{x})$ denotes a normalization constant given by

Z_{\mathcal{A}}(\mathbf{x})=\sum_{\mathbf{x}^{\prime}\in\mathcal{A}^{U\times 1}}\exp\Big{(}-\frac{1}{\gamma^{2}}\|\mathbf{x}^{\prime}-\mathbf{x}+\tau\nabla f_{\mathcal{I}}(\mathbf{x})\|^{2}\Big{)},

(39)

which is generally intractable due to the requirement for traversal over the entire space of size $M^{U}$ .

To address this challenge and simplify computation, we choose to omit the proposal probability ratio $\frac{q(\mathbf{x}|\mathbf{x}^{\prime})}{q(\mathbf{x}^{\prime}|\mathbf{x})}$ in (37), by noting that $\nabla f_{\mathcal{I}}(\mathbf{x})\to\mathbf{0}$ at the latter iterations of the algorithm as the chain reaches the high probability region of the state space. Given mild assumptions from [46], this strategy is supported by the following proposition.

Proposition 1

Assume the following:

1.

$\nabla f_{\mathcal{I}}(\mathbf{x})=\mathbf{0}$ .
2.

The learning rate $\tau=o(1)$ for large $U$ , where $o(\cdot)$ denotes the standard asymptotic notation for higher-order infinitesimals [39, Ch. 3.1].
3.
There exists a $\lambda>0$ such that $\sup_{(\mathbf{x},\mathcal{I})}\left\|H_{\mathcal{I}}(\mathbf{x})\boldsymbol{z}\right\|\leq\lambda\|\boldsymbol{z}\|$ for all $\boldsymbol{z}\in\mathbb{C}^{U\times 1}$ , where:
- •
  
  $\sup_{(\mathbf{x},\mathcal{I})}$ denotes the supremum over all possible selections of $(\mathbf{x},\mathcal{I})$ ,
- •
  
  $H_{\mathcal{I}}(\mathbf{x})$ denotes the Hessian matrix on mini-batch $\mathcal{I}$ .

Then, we have

\frac{q(\mathbf{x}|\mathbf{x}^{\prime})}{q(\mathbf{x}^{\prime}|\mathbf{x})}=1+o(1)\approx 1.

(40)

Proof:

Defining $\mathbf{x}^{\prime}-\mathbf{x}\triangleq\boldsymbol{z}$ and by the assumption $\nabla f_{\mathcal{I}}(\mathbf{x})=\mathbf{0}$ , we have

	$\displaystyle\frac{q(\mathbf{x}\|\mathbf{x}^{\prime})}{q(\mathbf{x}^{\prime}\|\mathbf{x})}$	$\displaystyle=\frac{\exp\left(-\frac{1}{\gamma^{2}}\\|\mathbf{x}-\mathbf{x}^{\prime}+\tau\nabla f_{\mathcal{I}}(\mathbf{x}^{\prime})\\|^{2}\right)Z_{\mathcal{A}}(\mathbf{x})}{\exp\left(-\frac{1}{\gamma^{2}}\\|\mathbf{x}^{\prime}-\mathbf{x}+\tau\nabla f_{\mathcal{I}}(\mathbf{x})\\|^{2}\right)Z_{\mathcal{A}}(\mathbf{x}^{\prime})}$
		$\displaystyle=\frac{\exp\left(-\frac{1}{\gamma^{2}}\\|\tau\nabla f_{\mathcal{I}}(\mathbf{x}^{\prime})-\boldsymbol{z}\\|^{2}\right)Z_{\mathcal{A}}(\mathbf{x})}{\exp\left(-\frac{1}{\gamma^{2}}\\|\boldsymbol{z}\\|^{2}\right)Z_{\mathcal{A}}(\mathbf{x}^{\prime})}.$		(41)

Note that

		$\displaystyle\\|\tau\nabla f_{\mathcal{I}}(\mathbf{x}^{\prime})-\boldsymbol{z}\\|^{2}$
	$\displaystyle=$	$\displaystyle{\tau^{2}}\\|\nabla f_{\mathcal{I}}(\mathbf{x}^{\prime})\\|^{2}+\\|\boldsymbol{z}\\|^{2}-{\tau}\big{(}\nabla f_{\mathcal{I}}(\mathbf{x}^{\prime})^{\rm H}\boldsymbol{z}+\boldsymbol{z}^{\rm H}\nabla f_{\mathcal{I}}(\mathbf{x}^{\prime})\big{)}.$		(42)

Using the Taylor series expansion $\nabla f_{\mathcal{I}}(\mathbf{x}^{\prime})=\nabla f_{\mathcal{I}}(\mathbf{x})+H_{\mathcal{I}}(\mathbf{x}_{0})(\mathbf{x}^{\prime}-\mathbf{x})=H_{\mathcal{I}}(\mathbf{x}_{0})\boldsymbol{z}$ for some $\mathbf{x}_{0}$ between $\mathbf{x}$ and $\mathbf{x}^{\prime}$ , and by the assumption $\sup_{(\mathbf{x},\mathcal{I})}\left\|H_{\mathcal{I}}(\mathbf{x})\boldsymbol{z}\right\|\leq\lambda\|\boldsymbol{z}\|$ , we have

$\displaystyle\\|\nabla f_{\mathcal{I}}(\mathbf{x}^{\prime})\\|^{2}$	$\displaystyle=\\|H_{\mathcal{I}}(\mathbf{x}_{0})\boldsymbol{z}\\|^{2}\leq\lambda^{2}\\|\boldsymbol{z}\\|^{2},$	(43)
$\displaystyle\|\nabla f_{\mathcal{I}}(\mathbf{x}^{\prime})^{\rm H}\boldsymbol{z}\|$	$\displaystyle=\|\boldsymbol{z}^{\rm H}\nabla f_{\mathcal{I}}(\mathbf{x}^{\prime})\|\leq\\|\nabla f_{\mathcal{I}}(\mathbf{x}^{\prime})\\|\\|\boldsymbol{z}\\|$
	$\displaystyle=\\|H_{\mathcal{I}}(\mathbf{x}_{0})\boldsymbol{z}\\|\\|\boldsymbol{z}\\|\leq\lambda\\|\boldsymbol{z}\\|^{2},$	(44)

where the first inequality in (44) follows from the Cauchy-Schwarz inequality. Then by the assumption $\tau=o(1)$ for large $U$ , which is approximately satisfied by the learning rate selection in (26), it follows that (42) is of the same order as $(1+\mathcal{O}(\tau))\|\boldsymbol{z}\|^{2}=(1+o(1))\|\boldsymbol{z}\|^{2}$ . Therefore, the proposal probability ratio in (41) becomes

\frac{q(\mathbf{x}|\mathbf{x}^{\prime})}{q(\mathbf{x}^{\prime}|\mathbf{x})}=\frac{\exp\left(-\frac{1}{\gamma^{2}}\|\boldsymbol{z}\|^{2}\right)Z_{\mathcal{A}}(\mathbf{x})}{\exp\left(-\frac{1}{\gamma^{2}}\|\boldsymbol{z}\|^{2}\right)Z_{\mathcal{A}}(\mathbf{x}^{\prime})}(1+o(1))\approx 1,

(45)

completing the proof. ∎

After omitting the proposal probability ratio, we derive the modified acceptance probability as follows

A(\mathbf{x}^{\prime}|\mathbf{x})=\min\left\{1,\frac{\exp(-\|\mathbf{y}-\mathbf{H}\mathbf{x}^{\prime}\|^{2})}{\exp(-\|\mathbf{y}-\mathbf{H}\mathbf{x}\|^{2})}\right\},

(46)

which we used in (28) for our proposed algorithm. By accepting $\mathbf{x}^{\prime}$ as the new state with this probability, the Markov chain approximately satisfies detailed balance, which is a sufficient condition to align the stationary distribution of the Markov chain with the target posterior distribution $\pi$ [18, 47].

Moreover, for any two states $\mathbf{x},\mathbf{x}^{\prime}\in\mathcal{A}^{U\times 1}$ , the transition probability of the Markov chain is given by

	$\displaystyle T(\mathbf{x}^{\prime}\|\mathbf{x})$
	$\displaystyle=\left\{\begin{array}[]{ll}q(\mathbf{x}^{\prime}\|\mathbf{x})A(\mathbf{x}^{\prime}\|\mathbf{x}),&\text{ if }\mathbf{x}^{\prime}\neq\mathbf{x},\\ q(\mathbf{x}\|\mathbf{x})+\sum_{\mathbf{u}\neq\mathbf{x}}q(\mathbf{u}\|\mathbf{x})\big{(}1-A(\mathbf{u}\|\mathbf{x})\big{)},&\text{ otherwise},\end{array}\right.$		(49)

where the second branch holds because the chain remains in state $\mathbf{x}$ either when the candidate is $\mathbf{x}$ or if the candidate is $\mathbf{u}\neq\mathbf{x}$ but gets rejected. With this transition probability, it is easy to show that

T\big{(}\mathbf{x}_{t}=\mathbf{x}^{\prime}|\mathbf{x}_{t-1}=\mathbf{x}\big{)}>0,\;\forall\mathbf{x},\mathbf{x}^{\prime}\in\mathcal{A}^{U\times 1},

(50)

implying the chain’s irreducibility by definition [48], and

\text{gcd}\left\{m:T\big{(}\mathbf{x}_{t+m}=\mathbf{x}|\mathbf{x}_{t}=\mathbf{x}\big{)}>0\right\}=1,\;\forall m\in\mathbb{Z}^{+},

(51)

where $\mathbb{Z}^{+}$ denotes the set of positive integers, and “gcd” represents the greatest common divisor, verifying the chain’s aperiodicity by definition [48]. Using these properties, it can be shown that $\pi$ is a unique stationary distribution of the proposed algorithm’s underlying Markov chain [49, Corollary 1.17]. This chain approximately converges to the target posterior distribution $\pi$ according to the convergence theorem of MCMC [49, Th. 4.9], thereby verifying the convergence of the proposed algorithm.

References

[1] X. Zhou, L. Liang, J. Zhang, C.-K. Wen, and S. Jin, “Decentralized massive MIMO detection using mini-batch gradient-based MCMC,” in Proc. IEEE Int. Workshop Signal Process. Adv. Wireless Commun. (SPAWC), Lucca, Italy, Sep. 2024, pp. 1–5.
[2] E. Björnson, J. Hoydis, and L. Sanguinetti, “Massive MIMO networks: Spectral, energy, and hardware efficiency,” Foundations and Trends® in Signal Processing, vol. 11, no. 3-4, pp. 154–655, 2017.
[3] S. Yang and L. Hanzo, “Fifty years of MIMO detection: The road to large-scale MIMOs,” IEEE Commun. Surv. Tuts., vol. 17, no. 4, pp. 1941–1988, 2015.
[4] E. Björnson, L. Sanguinetti, H. Wymeersch, J. Hoydis, and T. L. Marzetta, “Massive MIMO is a reality—What is next?: Five promising research directions for antenna arrays,” Digit. Signal Process., vol. 94, pp. 3–20, Nov. 2019.
[5] A. Puglielli et al., “Design of energy- and cost-efficient massive MIMO arrays,” Proc. IEEE, vol. 104, no. 3, pp. 586–606, Mar. 2016.
[6] K. Li, R. R. Sharan, Y. Chen, T. Goldstein, J. R. Cavallaro, and C. Studer, “Decentralized baseband processing for massive MU-MIMO systems,” IEEE J. Emerg. Sel. Topics Circuits Syst., vol. 7, no. 4, pp. 491–507, Dec. 2017.
[7] J. Rodríguez Sánchez, F. Rusek, O. Edfors, M. Sarajlić, and L. Liu, “Decentralized massive MIMO processing exploring daisy-chain architecture and recursive algorithms,” IEEE Trans. Signal Process., vol. 68, pp. 687–700, Jan. 2020.
[8] “Common public radio interface,” [Online]. Available: http://www.cpri.info, 2019.
[9] A. Kulkarni, M. A. Ouameur, and D. Massicotte, “Hardware topologies for decentralized large-scale MIMO detection using Newton method,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 68, no. 9, pp. 3732–3745, Sep. 2021.
[10] K. Li, Y. Chen, R. Sharan, T. Goldstein, J. R. Cavallaro, and C. Studer, “Decentralized data detection for massive MU-MIMO on a Xeon Phi cluster,” in Proc. Asilomar Conf. Signals, Syst. Comput., Nov. 2016, pp. 468–472.
[11] K. Li, O. Castañeda, C. Jeon, J. R. Cavallaro, and C. Studer, “Decentralized coordinate-descent data detection and precoding for massive MU-MIMO,” in Proc. IEEE Int. Symp. Circuits Syst., May 2019, pp. 1–5.
[12] C. Jeon, K. Li, J. R. Cavallaro, and C. Studer, “Decentralized equalization with feedforward architectures for massive MU-MIMO,” IEEE Trans. Signal Process., vol. 67, no. 17, pp. 4418–4432, Sep. 2019.
[13] E. Bertilsson, O. Gustafsson, and E. G. Larsson, “A scalable architecture for massive MIMO base stations using distributed processing,” in Proc. Asilomar Conf. Signals, Syst., Comput., Nov. 2016, pp. 864–868.
[14] H. Wang, A. Kosasih, C.-K. Wen, S. Jin, and W. Hardjawana, “Expectation propagation detector for extra-large scale massive MIMO,” IEEE Trans. Wireless Commun., vol. 19, no. 3, pp. 2036–2051, Mar. 2020.
[15] Z. Zhang, H. Li, Y. Dong, X. Wang, and X. Dai, “Decentralized signal detection via expectation propagation algorithm for uplink massive MIMO systems,” IEEE Trans. Veh. Technol., vol. 69, no. 10, pp. 11 233–11 240, Oct. 2020.
[16] Y. Dong, H. Li, C. Gong, X. Wang, and X. Dai, “An enhanced fully decentralized detector for the uplink M-MIMO system,” IEEE Trans. Veh. Technol., vol. 71, no. 12, pp. 13 030–13 042, Dec. 2022.
[17] Z. Zhang, Y. Dong, K. Long, X. Wang, and X. Dai, “Decentralized baseband processing with Gaussian message passing detection for uplink massive MU-MIMO systems,” IEEE Trans. Veh. Technol., vol. 71, no. 2, pp. 2152–2157, Feb. 2022.
[18] C. M. Bishop, Pattern recognition and machine learning. New York, NY, USA: Springer, 2006.
[19] S. Brooks, A. Gelman, G. Jones, and X.-L. Meng, Handbook of Markov chain Monte Carlo. Boca Raton, FL, USA: Chapman & Hall, 2011.
[20] B. Farhang-Boroujeny, H. Zhu, and Z. Shi, “Markov chain Monte Carlo algorithms for CDMA and MIMO communication systems,” IEEE Trans. Signal Process., vol. 54, no. 5, pp. 1896–1909, May 2006.
[21] J. C. Hedstrom, C. H. Yuen, R.-R. Chen, and B. Farhang-Boroujeny, “Achieving near MAP performance with an excited Markov chain Monte Carlo MIMO detector,” IEEE Trans. Wireless Commun., vol. 16, no. 12, pp. 7718–7732, Dec. 2017.
[22] S. A. Laraway and B. Farhang-Boroujeny, “Implementation of a Markov chain Monte Carlo based multiuser/MIMO detector,” IEEE Trans. Circuits and Syst. I: Reg. Papers, vol. 56, no. 1, pp. 246–255, Jan. 2009.
[23] Y.-A. Ma, Y. Chen, C. Jin, N. Flammarion, and M. I. Jordan, “Sampling can be faster than optimization,” Proc. Nat. Acad. Sci., vol. 116, no. 42, pp. 20 881–20 885, Sep. 2019.
[24] M. Welling and Y. W. Teh, “Bayesian learning via stochastic gradient Langevin dynamics,” in Proc. 28th Int. Conf. Mach. Learn. (ICML), pp. 681-688, 2011.
[25] N. M. Gowda, S. Krishnamurthy, and A. Belogolovy, “Metropolis-Hastings random walk along the gradient descent direction for MIMO detection,” in Proc. IEEE Int. Conf. Commun. (ICC), Montreal, QC, Canada, Jun. 2021, pp. 1–7.
[26] N. Zilberstein, C. Dick, R. Doost-Mohammady, A. Sabharwal, and S. Segarra, “Annealed Langevin dynamics for massive MIMO detection,” IEEE Trans. Wireless Commun., vol. 22, no. 6, pp. 3762–3776, Jun. 2023.
[27] X. Zhou, L. Liang, J. Zhang, C.-K. Wen, and S. Jin, “Gradient-based Markov chain Monte Carlo for MIMO detection,” IEEE Trans. Wireless Commun., vol. 23, no. 7, pp. 7566–7581, Jul. 2024.
[28] Y. Nesterov, Introductory lectures on convex optimization. Springer Science+Business Media, 2004.
[29] H. Li, Y. Dong, C. Gong, X. Wang, and X. Dai, “Decentralized groupwise expectation propagation detector for uplink massive MU-MIMO systems,” IEEE Internet Things J., vol. 10, no. 6, pp. 5393–5405, Mar. 2023.
[30] X. Zhao, M. Li, B. Wang, E. Song, T.-H. Chang, and Q. Shi, “Decentralized equalization for massive MIMO systems with colored noise samples,” May 2023. [Online] Available: http://arxiv.org/abs/2305.12805.
[31] W. K. Hastings, “Monte Carlo sampling methods using Markov chains and their applications,” Biometrika, vol. 57, no. 1, pp. 97–109, Apr. 1970.
[32] L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large-scale machine learning,” SIAM Rev., vol. 60, no. 2, pp. 223–311, Jan. 2018.
[33] B. O’Donoghue and E. Candès, “Adaptive restart for accelerated gradient schemes,” Found. Comput. Math., vol. 15, no. 3, pp. 715–732, Jun. 2015.
[34] H. Q. Ngo, E. G. Larsson, and T. L. Marzetta, “Aspects of favorable propagation in massive MIMO,” in Proc. Eur. Signal Process. Conf. (EUSIPCO), Sep. 2014, pp. 76–80.
[35] A. Barbu and S.-C. Zhu, “Hamiltonian and Langevin Monte Carlo,” in Monte Carlo methods, A. Barbu and S.-C. Zhu, Eds. Singapore: Springer, 2020, pp. 281–325.
[36] J. Martin, C. L. Wilcox, C. Burstedde, and O. Ghattas, “A stochastic Newton MCMC method for large-scale statistical inverse problems with application to seismic inversion,” SIAM J. Sci. Comput., vol. 34, no. 3, pp. 1460–1487, Jan. 2012.
[37] K. P. Murphy, Probabilistic machine learning: Advanced topics., Cambridge, MA, USA: MIT Press, 2023.
[38] P. W. Wolniansky, G. J. Foschini, G. D. Golden, and R. A. Valenzuela, “V-BLAST: An architecture for realizing very high data rates over the rich-scattering wireless channel,” Proc. Int. Symp. Signals Syst. Electron. (ISSSE), Oct. 1998, pp. 295–300.
[39] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, 2nd ed. Cambridge, Massachusetts: MIT Press, 2001.
[40] Z. Guo and P. Nilsson, “Algorithm and implementation of the K-best sphere decoding for MIMO detection,” IEEE J. Sel. Areas Commun., vol. 24, no. 3, pp. 491–503, Mar. 2006.
[41] 3GPP TR 36.873, “Study on 3D channel model for LTE (Release 12),” Tech. Rep., Jun. 2017. [Online]. Available: https://www.3gpp.org/ftp/Specs/archive/36_series/36.873/.
[42] X. Qian and D. Klabjan, “The impact of the mini-batch size on the variance of gradients in stochastic gradient descent,” Apr. 2020. [Online] Available: http://arxiv.org/abs/2004.13146.
[43] S. Jaeckel, L. Raschkowski, K. Börner, and L. Thiele, “QuaDRiGa: A 3-D multi-cell channel model with time evolution for enabling virtual field trials,” IEEE Trans. Antennas Propag., vol. 62, no. 6, pp. 3242–3256, Jun. 2014.
[44] B. Hassibi, M. Hansen, A. G. Dimakis, H. A. J. Alshamary, and W. Xu, “Optimized Markov chain Monte Carlo for signal detection in MIMO systems: An analysis of the stationary distribution and mixing time,” IEEE Trans. Signal Process., vol. 62, no. 17, pp. 4436–4450, Sep. 2014.
[45] R. Zhang, X. Liu, and Q. Liu, “A Langevin-like sampler for discrete distributions,” in Proc. Int. Conf. Mach. Learn. (ICML), Jun. 2022, pp. 26 375–26 396.
[46] T.-Y. Wu, Y. X. R. Wang, and W. H. Wong, “Mini-batch Metropolis-Hastings with reversible SGLD proposal,” J. Amer. Stat. Assoc., vol. 117, no. 537, pp. 386–394, Jan. 2022.
[47] A. Doucet and X. Wang, “Monte Carlo methods for signal processing: A review in the statistical signal processing context,” IEEE Signal Process. Mag., vol. 22, no. 6, pp. 152–170, Nov. 2005.
[48] J. R. Norris, Markov Chains. Cambridge, U.K.: Cambridge Univ. Press, 1997.
[49] D. A. Levin, Y. Peres, and E. L. Wilmer, Markov Chains and Mixing Times. Providence, RI, USA: Amer. Math. Soc., 2008.

	$\displaystyle A(\mathbf{x}^{\prime}\|\mathbf{x})$	$\displaystyle=\min\left\{1,\frac{\pi(\mathbf{x}^{\prime})q(\mathbf{x}\|\mathbf{x}^{\prime})}{\pi(\mathbf{x})q(\mathbf{x}^{\prime}\|\mathbf{x})}\right\}$
		$\displaystyle=\min\left\{1,\frac{\exp(-\\|\mathbf{y}-\mathbf{H}\mathbf{x}^{\prime}\\|^{2})q(\mathbf{x}\|\mathbf{x}^{\prime})}{\exp(-\\|\mathbf{y}-\mathbf{H}\mathbf{x}\\|^{2})q(\mathbf{x}^{\prime}\|\mathbf{x})}\right\}.$		(37)

	$\displaystyle\frac{q(\mathbf{x}\|\mathbf{x}^{\prime})}{q(\mathbf{x}^{\prime}\|\mathbf{x})}$	$\displaystyle=\frac{\exp\left(-\frac{1}{\gamma^{2}}\\|\mathbf{x}-\mathbf{x}^{\prime}+\tau\nabla f_{\mathcal{I}}(\mathbf{x}^{\prime})\\|^{2}\right)Z_{\mathcal{A}}(\mathbf{x})}{\exp\left(-\frac{1}{\gamma^{2}}\\|\mathbf{x}^{\prime}-\mathbf{x}+\tau\nabla f_{\mathcal{I}}(\mathbf{x})\\|^{2}\right)Z_{\mathcal{A}}(\mathbf{x}^{\prime})}$
		$\displaystyle=\frac{\exp\left(-\frac{1}{\gamma^{2}}\\|\tau\nabla f_{\mathcal{I}}(\mathbf{x}^{\prime})-\boldsymbol{z}\\|^{2}\right)Z_{\mathcal{A}}(\mathbf{x})}{\exp\left(-\frac{1}{\gamma^{2}}\\|\boldsymbol{z}\\|^{2}\right)Z_{\mathcal{A}}(\mathbf{x}^{\prime})}.$		(41)

$\displaystyle\\|\nabla f_{\mathcal{I}}(\mathbf{x}^{\prime})\\|^{2}$	$\displaystyle=\\|H_{\mathcal{I}}(\mathbf{x}_{0})\boldsymbol{z}\\|^{2}\leq\lambda^{2}\\|\boldsymbol{z}\\|^{2},$	(43)
$\displaystyle\|\nabla f_{\mathcal{I}}(\mathbf{x}^{\prime})^{\rm H}\boldsymbol{z}\|$	$\displaystyle=\|\boldsymbol{z}^{\rm H}\nabla f_{\mathcal{I}}(\mathbf{x}^{\prime})\|\leq\\|\nabla f_{\mathcal{I}}(\mathbf{x}^{\prime})\\|\\|\boldsymbol{z}\\|$
	$\displaystyle=\\|H_{\mathcal{I}}(\mathbf{x}_{0})\boldsymbol{z}\\|\\|\boldsymbol{z}\\|\leq\lambda\\|\boldsymbol{z}\\|^{2},$	(44)