This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Semi-Data-Aided Channel Estimation for
MIMO Systems via Reinforcement Learning

Tae-Kyoung Kim Yo-Seb Jeon Jun Li Nima Tavangaran and H. Vincent Poor This paper was presented in part at the 2020 IEEE International Conference on Communications (ICC) [1].Tae-Kyoung Kim is with the Department of Electronic Engineering, Gachon University, Seongnam 13120, Republic of Korea (e-mail: tk415kim@gmail.com).Yo-Seb Jeon is with the Department of Electrical Engineering, POSTECH, Pohang 37673, Republic of Korea (e-mail: yoseb.jeon@postech.ac.kr).Jun Li is with the School of Electronic and Optical Engineering, Nanjing University of Science and Technology, Nanjing 210094, China. He is also with the Department of Software Engineering, Institute of Cybernetics, National Research Tomsk Polytechnic University, Tomsk 634050, Russia (e-mail: jun.li@njust.edu.cn).Nima Tavangaran and H. Vincent Poor are with the Department of Electrical and Computer Engineering, Princeton University, Princeton, NJ 08544 USA (e-mail: nimat@princeton.edu; poor@princeton.edu).
Abstract

Data-aided channel estimation is a promising solution to improve channel estimation accuracy by exploiting data symbols as pilot signals for updating an initial channel estimate. In this paper, we propose a semi-data-aided channel estimator for multiple-input multiple-output communication systems. Our strategy is to leverage reinforcement learning (RL) for selecting reliable detected symbols among the symbols in the first part of transmitted data block. This strategy facilitates an update of the channel estimate before the end of data block transmission and therefore achieves a significant reduction in communication latency compared to conventional data-aided channel estimation approaches. Towards this end, we first define a Markov decision process (MDP) which sequentially decides whether to use each detected symbol as an additional pilot signal. We then develop an RL algorithm to efficiently find the best policy of the MDP based on a Monte Carlo tree search approach. In this algorithm, we exploit the a-posteriori probability for approximating both the optimal future actions and the corresponding state transitions of the MDP and derive a closed-form expression for the best policy. Simulation results demonstrate that the proposed channel estimator effectively mitigates both channel estimation error and detection performance loss caused by insufficient pilot signals.

Index Terms:
Multiple-input multiple-output (MIMO), channel estimation, data-aided channel estimation, reinforcement learning, Monte Carlo tree search.

I Introduction

Multiple-input multiple-output (MIMO) communication is one of the core technologies in modern wireless standards. The use of multiple antennas significantly improves both the capacity and the reliability of wireless systems by providing spatial multiplexing and diversity gains [2, 3, 4]. A key requirement to enjoy these benefits is accurate channel state information (CSI) at both transmitter and receiver. For example, the capacity of MIMO communication systems increases linearly with the number of either transmit or receive antennas under the premise that perfect CSI is available at both the transmitter and receiver [2, 3].

To obtain accurate CSI at the receiver (CSIR), various channel estimation techniques have been developed for MIMO communication systems [5, 6, 7, 8, 9, 15, 14, 10, 11, 12, 13, 16, 17]. One of the most popular and widely adopted technique is pilot-aided channel estimation [5, 6, 7, 8]. The fundamental idea of this technique is to send pilot signals, known as a priori at the receiver, and then to estimate the CSI from received signals observed during pilot transmission. A representative example of this technique is the least-squares (LS) channel estimator that minimizes the sum of squared errors in the estimated CSIR [7, 8]. Another example is the linear minimum-mean-squared-error (LMMSE) channel estimator which is a linear estimator that minimizes the mean-squared-error (MSE) of the estimated CSIR based on the first-order and the second-order channel statistics [7, 8]. The accuracy of the CSIR obtained from pilot-aided channel estimation improves with the number of the pilot signals available in a communication system. In addition, the larger the number of spatially multiplexed data streams utilized in MIMO systems, the larger the number of pilot signals required for accurate CSIR. Despite this requirement, in practical MIMO communication systems, only a small portion of radio resources are allocated for pilot transmission, while most of the radio resources are allocated for transmitting data (non-pilot) signals.

Data-aided channel estimation is a promising solution to overcome the limitation of pilot-aided channel estimation due to an insufficient number of pilot signals [9, 10, 11, 12, 13, 14, 15, 16, 17]. The basic strategy of the data-aided channel estimation is to exploit data symbols as additional pilot signals to update an initial channel estimate obtained from pilot-aided channel estimation. This strategy allows the receiver to enjoy the effect of increasing the number of pilot signals and therefore has a potential to provide more accurate CSIR compared to the pilot-aided channel estimation without sacrificing radio resource for data transmission. A non-iterative data-aided channel estimation was first investigated in [9]. In this method, data symbols are reconstructed by properly encoding and modulating the outputs of channel decoder, so that the reconstructed data symbols are utilized as pilot signals for channel estimation. The performance of this method, however, is degraded under the presence of decoding error which leads to the mismatch between the reconstructed and transmitted data symbols. To resolve this problem, iterative data-aided channel estimation has been studied in [15, 13, 16, 14, 17], which iteratively performs channel estimation and data detection to mitigate both channel estimation and decoding errors. In [15], an iterative turbo channel estimation technique was developed in which soft-decision symbols are utilized as pilot signals at each iteration. A similar iterative approach was also developed in [16] by selectively utilizing soft-decision symbols as pilot signals according to an MSE-based criterion. The common limitation of these iterative data-aided channel estimators is that they increase not only the computational complexity of receive processing, but also communication latency.

Recently, deep-learning-based channel estimation has also drawn increasing attention in order to circumvent the limitation of pilot-aided channel estimation [18, 19, 22, 21, 20, 23, 24, 25, 26]. A basic idea of this technique is to learn a channel from training samples, each of which describes the input-output relation of a communication system. The most prominent feature of the deep-learning-based channel estimation is that it can be readily incorporated into complicated communication systems, e.g., massive MIMO, millimeter-wave, and doubly-selective channels [22, 21, 20]. The use of deep learning, however, requires a huge training set to optimize neural networks and therefore increases both computational complexity and communication latency. To resolve this drawback, a model-driven deep learning approach was studied in [23, 24, 25, 26]. This approach effectively reduces the size of training set by learning only the parameters of a model for estimating the channel. Specifically, a joint optimization with data detection and channel estimation was introduced in [25] based on a Bayesian model. A similar channel estimation method for millimeter-wave MIMO systems was introduced in [26]. Although these model-driven channel estimators effectively mitigate the limitation of the deep learning-based channel estimation, the use of deep learning still brings non-negligible computational complexity and communication latency that may not be affordable in practical systems.

This paper presents a new type of data-aided channel estimation for MIMO communication systems, referred to as semi-data-aided channel estimation, which reduces communication latency caused by iterative data-aided channel estimation. The basic strategy of the presented channel estimator is to leverage reinforcement learning (RL) for selecting reliable detected symbol vectors only among the symbols in the first part of data block. The most prominent feature of the presented channel estimator is that it does not utilize the channel decoder outputs and therefore facilitates an early update of a channel estimate even before the end of data block transmission. Simulation results demonstrate that the presented channel estimator effectively mitigate both channel estimation error and detection performance loss caused by insufficient pilot signals. The major contributions of this paper are summarized as follows:

  • We present a Markov decision process (MDP) to sequentially determine the best selection of detected symbol vectors for minimizing the MSE of the semi-data-aided channel estimation. To this end, we adopt a binary action that indicates whether to exploit each detected symbol vector as an additional pilot signal, while defining a reward function as the MSE reduction of the channel estimate. With this MDP, we successfully formulate a symbol vector selection problem for the semi-data-aided channel estimation as a sequential decision-making problem that can be efficiently solved via RL.

  • We propose a novel RL algorithm to efficiently find the best policy of the presented MDP. The underlying challenge is that the state transition of the presented MDP is unknown at the receiver due to the lack of knowledge of transmitted symbol vectors. In the proposed algorithm, we tackle this challenge by leveraging a Monte Carlo tree search (MCTS) approach in [27, 28, 29] which looks ahead the rewards of near-future actions, while approximating the rewards of distant-future actions via Monte Carlo simulations. We modify the original MCTS approach by exploiting a-posteriori probability (APP), computed from data detection, for approximating both the optimal future actions and the corresponding state transitions of the MDP. The most prominent advantage of the proposed RL algorithm is that the best policy for each state has a closed-form expression that can be readily computed at the receiver.

  • We present two additional strategies for enhancing the advantages of the semi-data-aided channel estimation operating with the proposed RL algorithm. In the first strategy, we develop a low-complexity policy that approximates the optimal policy of the presented MDP based on Monte Carlo sampling. Utilizing this new policy, we further reduce the computational complexity required in the proposed RL algorithm. In the second strategy, we utilize an updated channel estimate for re-detecting the symbol vectors that are not selected by the proposed RL algorithm. Utilizing this strategy, we further improve data detection performance when employing the semi-data-aided channel estimation, without a significant increase in the computational complexity.

  • In simulations, we evaluate the normalized MSE (NMSE) and block-error-rate (BLER) of the proposed channel estimator for a coded MIMO communication system. Our simulation results demonstrate that the proposed channel estimator significantly reduces the NMSE in channel estimation, while improving the BLER of the system, compared to conventional pilot-aided channel estimation. It is also shown that the proposed RL algorithm effectively selects detected symbol vectors that can improve the performance of the semi-data-aided channel estimation. We also investigate the robustness of the proposed channel estimator in time-varying channels and demonstrate that the proposed channel estimator reduces performance degradation in time-varying environment by tracking temporal channel variations during data transmission.

An RL algorithm for optimizing the symbol vector selection of data-aided channel estimation was first introduced in our prior work [1]. In this algorithm, the optimal policy of the MDP is derived under a simplistic assumption that underestimates the effect of future actions and rewards. In this paper, we generalize the RL algorithm in [1] by employing the MCTS approach which provides a more accurate evaluation of the effect of the future actions and rewards. In addition to this major change, we newly introduce the semi-data-aided channel estimation strategy to further reduce the delay required for updating the channel estimate and also introduce the data re-detection strategy to improve detection performance after the symbol vector selection.

The remainder of this paper is organized as follows. Section II introduces system model and preliminaries considered in this paper. In Section III, we formulate an optimization problem that adaptively selects the detected symbols for the semi-data-aided channel estimator. An efficient RL algorithm to solve the optimization problem is proposed in Section IV. Simulation results are presented in Section V to verify the effectiveness of the proposed channel estimator. The conclusion is finally presented in Section VI.

Notation

Matrices 𝟎m\mathbf{0}_{m} and 𝐈m\mathbf{I}_{m} represent the m×mm{\times}m all-zero matrix and the m×mm\times m identity matrix, respectively. Superscripts ()T(\cdot)^{T} and ()H(\cdot)^{H} denote the transpose and the conjugate transpose, respectively. Operators 𝔼()\mathbb{E}(\cdot), ()\mathbb{P}(\cdot), |||\cdot|, and F\|\cdot\|_{\rm F} denote the expectation of a random variable, the probability of an event, the cardinality of a set, and the Frobenius norm, respectively. ()1(\cdot)^{-1} denotes the inverse operation. The set \mathbb{C} represents the set of complex numbers.

II System Model and Preliminaries

In this section, we introduce a MIMO communication system considered in this work. The LMMSE channel estimator and the maximum-a-posteriori-probability (MAP) data detector are presented for the considered system. We then describe the challenge of the LMMSE channel estimator to achieve the optimal performance.

Refer to caption
Figure 1: A MIMO communication system in which a transmitter equipped with NtxN_{\rm{tx}} antennas communicates with a receiver equipped with NrxN_{\rm{rx}} antennas. A transmission frame consists of a pilot block with length TpT_{\rm{p}} followed by a data block with length TdT_{\rm{d}}. The data block consists of two parts: The lengths of the first part and the second part are TuT_{\rm u} and TdTuT_{\rm d}-T_{\rm u}, respectively.

II-A System model

We consider a MIMO communication system in which a transmitter equipped with NtxN_{\rm{tx}} antennas communicates with a receiver equipped with NrxN_{\rm{rx}} antennas, as illustrated in Fig. 1. We model the wireless channel of the considered system as a frequency-flat Rayleigh fading channel denoted by 𝐇Nrx×Ntx\mathbf{H}\in\mathbb{C}^{N_{\rm{rx}}\times N_{\rm{tx}}} where the entries of 𝐇{\bf H} are independent and identically distributed (i.i.d.) random variables with the distribution of 𝒞𝒩(0,1)\mathcal{CN}(0,1). We assume a block fading channel in which the entries of 𝐇\mathbf{H} keep constant during a transmission frame.

A transmission frame consists of a pilot block with length TpT_{\rm{p}} followed by a data block with length TdT_{\rm{d}}, as illustrated in Fig. 1. A set of time slot indices associated with the pilot block and the data block is denoted by 𝒯p={Tp+1,,0}\mathcal{T}_{\rm p}=\{-T_{\rm p}+1,\ldots,0\} and 𝒯d={1,,Td}\mathcal{T}_{\rm d}=\{1,\ldots,T_{\rm{d}}\}, respectively. Let 𝐩[n]Ntx\mathbf{p}[n]\in\mathbb{C}^{N_{\rm{tx}}} be the pilot signal sent at time slot nn such that 𝔼[𝐩[n]2]=Ntx\mathbb{E}\left[\|\mathbf{p}[n]\|^{2}\right]=N_{\rm{tx}}. Then the received signal at time slot n𝒯pn\in\mathcal{T}_{\rm p} is given by

𝐲[n]=𝐇𝐩[n]+𝐳[n],\displaystyle\mathbf{y}[n]=\mathbf{H}\mathbf{p}[n]+{\bf z}[n], (1)

where 𝐳[n]𝒞𝒩(𝟎Nrx,σ2𝐈Nrx)\mathbf{z}[n]\sim\mathcal{CN}\left(\mathbf{0}_{N_{\rm{rx}}},\sigma^{2}\mathbf{I}_{N_{\rm{rx}}}\right) is a circularly symmetric complex Gaussian noise vector at time slot nn. For the data transmission, the transmitter generates data symbol vectors after symbol mapping of information bits. Let 𝐱[n]𝒳Ntx\mathbf{x}[n]\in\mathcal{X}^{N_{\rm{tx}}} be the data symbol vector sent at time slot n𝒯dn\in\mathcal{T}_{\rm d}, where 𝒳\mathcal{X} is a constellation set such that 𝔼[𝐱[n]2]=Ntx\mathbb{E}\left[\|\mathbf{x}[n]\|^{2}\right]=N_{\rm{tx}}. Then the received signal at time slot n𝒯dn\in\mathcal{T}_{\rm d} during the data transmission is given by

𝐲[n]\displaystyle\mathbf{y}[n] =[y1[n],,yNrx[n]]T=𝐇𝐱[n]+𝐳[n].\displaystyle=\left[y_{1}[n],\cdots,y_{N_{\rm{rx}}}[n]\right]^{T}=\mathbf{H}\mathbf{x}[n]+\mathbf{z}[n]. (2)

II-B LMMSE Channel Estimator

The LMMSE channel estimator is a linear estimator that minimizes the MSE of a channel estimate. This method has been widely adopted in wireless communication systems as it provides a good trade-off between estimation accuracy and computational complexity [7, 8]. Let 𝐘p{\bf Y}_{\rm{p}} a matrix that concatenates received signals observed during the pilot transmission. From (1), 𝐘p{\bf Y}_{\rm{p}} is expressed as

𝐘p\displaystyle{\bf Y}_{\rm{p}} =[𝐲[Tp+1],,𝐲[0]]=𝐇𝐏+𝐙p,\displaystyle=\big{[}{\bf y}[-T_{\rm p}+1],\cdots,{\bf y}[0]\big{]}=\mathbf{H}{\bf P}+{\bf Z}_{\rm{p}}, (3)

where 𝐏=[𝐩[Tp+1],,𝐩[0]]\mathbf{P}=\big{[}\mathbf{p}[-T_{\rm p}+1],\cdots,\mathbf{p}[0]\big{]}, and 𝐙p=[𝐳[Tp+1],,𝐳[0]]\mathbf{Z}_{\rm{p}}=\big{[}\mathbf{z}[-T_{\rm p}+1],\cdots,\mathbf{z}[0]\big{]}. From (3), the LMMSE channel estimator is given by

𝐖LMMSE\displaystyle\mathbf{W}_{\rm{LMMSE}} =argmin𝐖Tp×Ntx𝔼[𝐘p𝐖𝐇F2]\displaystyle=\operatornamewithlimits{argmin}_{{\bf W}\in\mathbb{C}^{T_{\rm{p}}\times N_{\rm{tx}}}}\mathbb{E}\left[\|{\bf Y}_{\rm p}{\bf W}-{\bf H}\|_{\rm F}^{2}\right]
=𝐏H(𝐏𝐏H+σ2𝐈Ntx)1,\displaystyle=\mathbf{P}^{H}\big{(}\mathbf{P}\mathbf{P}^{H}+\sigma^{2}\mathbf{I}_{N_{\rm{tx}}}\big{)}^{-1}, (4)

where the expectation is taken with respect to channel and noise distributions. Consequently, the LMMSE channel estimate is computed as

𝐇^p\displaystyle\hat{\bf H}_{\rm p} =𝐘p𝐏H(𝐏𝐏H+σ2𝐈Ntx)1.\displaystyle={\bf Y}_{\rm p}\mathbf{P}^{H}\big{(}\mathbf{P}\mathbf{P}^{H}+\sigma^{2}\mathbf{I}_{N_{\rm{tx}}}\big{)}^{-1}. (5)

If the entries of 𝐇{\bf H} are i.i.d. with 𝒞𝒩(0,1)\mathcal{CN}(0,1), the MSE of the LMMSE channel estimate is computed as

𝔼[𝐇^p𝐇F2]\displaystyle\mathbb{E}\big{[}\|\hat{\bf H}_{\rm p}-{\bf H}\|_{\rm F}^{2}\big{]} =NrxTr[𝔼[(𝐡^p,rH𝐡rH)(𝐡^p,r𝐡r)]]\displaystyle=N_{\rm rx}{\rm Tr}\big{[}\mathbb{E}\big{[}(\hat{\bf h}_{{\rm p},r}^{H}-{\bf h}_{r}^{H})(\hat{\bf h}_{{\rm p},r}-{\bf h}_{r})\big{]}\big{]}
=Nrxσ2Tr[(𝐏𝐏H+σ2𝐈Ntx)1],\displaystyle=N_{\rm rx}\sigma^{2}{\rm Tr}\big{[}\big{(}{\bf P}{\bf P}^{H}+\sigma^{2}{\bf I}_{N_{\rm tx}}\big{)}^{-1}\big{]}, (6)

where 𝐡^p,r\hat{\bf h}_{{\rm p},r} and 𝐡r{\bf h}_{r} are the rr-th row of 𝐇^p\hat{\bf H}_{\rm p} and 𝐇{\bf H}, respectively. As can be seen from (II-B), the MSE of the LMMSE channel estimate decreases with the number of the pilot signals TpT_{\rm p}.

II-C Maximum-A-Posteriori-Probability (MAP) Data Detector

In this work, we assume that the receiver employs the MAP data detection method which finds the symbol vector with the maximum APP for a received signal. This method is optimal in terms of minimizing detection error probability and therefore has a potential to maximize the performance of a channel estimator presented in Sec. IV. Nevertheless, as will be discussed later, the applicability of the presented channel estimator is not limited to the MAP data detection method.

Let 𝐱k\mathbf{x}_{k} be a vector in 𝒳Ntx\mathcal{X}^{N_{\rm{tx}}} with k𝒦={1,,K}k\in\mathcal{K}=\{1,\ldots,K\} where K=|𝒳|NtxK=|\mathcal{X}|^{N_{\rm{tx}}}. The APP of the event {𝐱[n]=𝐱k}\{\mathbf{x}[n]=\mathbf{x}_{k}\} for the given received signal 𝐲[n]\mathbf{y}[n] is expressed as

θk[n]\displaystyle{\theta}_{k}[n] [𝐱[n]=𝐱k|𝐲[n]]\displaystyle\triangleq\mathbb{P}\left[\mathbf{x}[n]=\mathbf{x}_{k}|\mathbf{y}[n]\right]
=[𝐲[n]|𝐱[n]=𝐱k][𝐱[n]=𝐱k]j𝒦[𝐲[n]|𝐱[n]=𝐱j][𝐱[n]=𝐱j]\displaystyle=\frac{\mathbb{P}\left[\mathbf{y}[n]|\mathbf{x}[n]=\mathbf{x}_{k}\right]\mathbb{P}\left[\mathbf{x}[n]=\mathbf{x}_{k}\right]}{\sum_{j\in\mathcal{K}}\mathbb{P}\left[\mathbf{y}[n]|\mathbf{x}[n]=\mathbf{x}_{j}\right]\mathbb{P}\left[\mathbf{x}[n]=\mathbf{x}_{j}\right]}
=(a)[𝐲[n]|𝐱[n]=𝐱k]j𝒦[𝐲[n]|𝐱[n]=𝐱j],\displaystyle\overset{(a)}{=}\frac{\mathbb{P}\left[\mathbf{y}[n]|\mathbf{x}[n]=\mathbf{x}_{k}\right]}{\sum_{j\in\mathcal{K}}\mathbb{P}\left[\mathbf{y}[n]|\mathbf{x}[n]=\mathbf{x}_{j}\right]}, (7)

where the equality (a) holds when the probability of transmitting each symbol vector is equal (i.e., [𝐱[n]=𝐱k]=1K\mathbb{P}\left[\mathbf{x}[n]=\mathbf{x}_{k}\right]=\frac{1}{K}, k𝒦\forall k\in\mathcal{K}). Since 𝐳[n]𝒞𝒩(𝟎Nrx,σ2𝐈Nrx)\mathbf{z}[n]\sim\mathcal{CN}\left(\mathbf{0}_{N_{\rm{rx}}},\sigma^{2}\mathbf{I}_{N_{\rm{rx}}}\right), the probability [𝐲[n]|𝐱[n]=𝐱k]\mathbb{P}\left[\mathbf{y}[n]|\mathbf{x}[n]=\mathbf{x}_{k}\right] in (II-C) is given by

[𝐲[n]|𝐱[n]=𝐱k]\displaystyle\mathbb{P}\left[\mathbf{y}[n]|\mathbf{x}[n]=\mathbf{x}_{k}\right] =1(πσ2)Nrxexp(𝐲[n]𝐇𝐱k2σ2),\displaystyle=\frac{1}{\left(\pi\sigma^{2}\right)^{N_{\rm{rx}}}}\exp\left(-\frac{\|\mathbf{y}[n]-\mathbf{H}\mathbf{x}_{k}\|^{2}}{\sigma^{2}}\right), (8)

for k𝒦k\in\mathcal{K}. This probability is also known as the likelihood function. By applying (8) into (II-C), the APP is computed as

θk[n]=exp(1σ2𝐲[n]𝐇𝐱k2)j𝒦exp(1σ2𝐲[n]𝐇𝐱j2).\displaystyle{\theta}_{k}[n]=\frac{\exp\big{(}-\frac{1}{\sigma^{2}}\|\mathbf{y}[n]-{\bf H}\mathbf{x}_{k}\|^{2}\big{)}}{\sum_{j\in\mathcal{K}}\exp\big{(}-\frac{1}{\sigma^{2}}\|\mathbf{y}[n]-{\bf H}\mathbf{x}_{j}\|^{2}\big{)}}. (9)

Then the MAP detection rule is given by

𝐱^[n]=𝐱k^n,wherek^n\displaystyle\hat{\bf x}[n]={\bf x}_{\hat{k}_{n}},~{}~{}\text{where}~{}~{}\hat{k}_{n} =argmaxk𝒦θk[n].\displaystyle=\operatornamewithlimits{argmax}\limits_{k\in\mathcal{K}}~{}\theta_{k}[n]. (10)

In practical communication systems, the receiver cannot compute the exact APP in (9) as it requires perfect information of 𝐇{\bf H}. As an alternative approach, an approximate APP is utilized for data detection, which is computed based on the MIMO channel estimate 𝐇^p\hat{\bf H}_{\rm p} from (5) as follows:

θ^k[n]=exp(1σ2𝐲[n]𝐇^p𝐱k2)j𝒦exp(1σ2𝐲[n]𝐇^p𝐱j2).\displaystyle\hat{\theta}_{k}[n]=\frac{\exp\big{(}-\frac{1}{\sigma^{2}}\|\mathbf{y}[n]-\hat{\bf H}_{\rm p}\mathbf{x}_{k}\|^{2}\big{)}}{\sum_{j\in\mathcal{K}}\exp\big{(}-\frac{1}{\sigma^{2}}\|\mathbf{y}[n]-\hat{\bf H}_{\rm p}\mathbf{x}_{j}\|^{2}\big{)}}. (11)

Unfortunately, when employing the pilot-aided channel estimation with an insufficient number of the pilot signals, channel estimation error (i.e., 𝐇^p𝐇\hat{\bf H}_{\rm p}-{\bf H}) is inevitable at the receiver, as shown in (II-B). Because this error leads to a mismatch between the true APP in (9) and the approximate APP in (11), the use of the approximate APP results in detection performance degradation. Moreover, the degree of the performance degradation increases as the number of the pilot signals, TpT_{\rm p}, reduces. To resolve this problem, in the following sections, we will present a novel channel estimation approach that utilizes detected symbol vectors to reduce the channel estimation error caused by insufficient pilot signals.

III Optimization Problem for Semi-Data-Aided Channel Estimation

Data-aided channel estimation is a well-known approach to reduce channel estimation error when the number of pilot signals is insufficient. The fundamental idea of the data-aided channel estimation is to exploit detected symbol vectors as additional pilot signals for updating a channel estimate. On the basis of the same idea, in this section, we present a new type of the data-aided channel estimation, referred to as semi-data-aided channel estimation, which enables fast update of the channel estimate with the selective use of detected symbol vectors. In what follows, we first elaborate on the basic idea of the semi-data-aided channel estimation and an optimization problem to maximize its performance. We then reformulate the optimization problem as an MDP in order to adopt RL to solve this problem.

III-A Semi-Data-Aided Channel Estimation

Our key observation is that not every detected symbol vector is a good candidate for a pilot signal because some detected symbol vectors differ from the transmitted symbol vectors due to data detection error. Another important observation is that once the receiver obtains a sufficient number of additional pilot signals, increasing the number of the pilot signals gives no significant improvement in channel estimation accuracy. Motivated by these observations, in the semi-data-aided channel estimation, we exploit only the detected symbol vectors that are beneficial for improving the channel estimation accuracy. Meanwhile, we select these symbol vectors only among the first TuT_{\rm u} detected symbol vectors, while utilizing the updated channel estimate for detecting the remaining TdTuT_{\rm d}-T_{\rm u} symbol vectors, as illustrated in Fig. 1. We refer to this strategy as a semi-data-aided channel estimation because it utilizes only a portion of detected symbol vectors, unlike the conventional data-aided channel estimation. The most prominent advantage of our strategy is that a channel estimate is updated after the transmission of TuT_{\rm u} symbol vectors; thereby, our strategy significantly reduces the delay required for updating the channel estimate compared to conventional data-aided channel estimation methods that updates the channel estimates after the end of data block transmission (i.e., TdT_{\rm d} time slots). Moreover, the semi-data-aided channel estimation does not utilize the outputs of a channel decoder, implying that the repetitions of channel decoding process is not necessary. Because of this feature, the computational complexity of the semi-data-aided channel estimation is lower than those of conventional data-aided channel estimation methods which require to repeat the channel decoding process (e.g., [10, 11, 12, 15, 16, 17]).

III-B Optimization Problem for Symbol Vector Selection

A key to the success of the semi-data-aided channel estimation is to optimize the selection of detected symbol vectors so that the accuracy of an updated channel estimate can be maximized. A direct optimization of the symbol vector selection, however, is very challenging in practical systems due to the lack of knowledge of transmitted symbol vectors and also due to high computational complexity. To shed some light on this challenge, we formulate an optimization problem for the symbol vector selection to minimize the error of the updated channel estimate. Let 𝐚{0,1}Tu{\bf a}\in\{0,1\}^{T_{\rm u}} be a vector whose nn-th entry indicates whether to utilize the detected symbol vector at time slot nn, 𝐱^[n]\hat{\bf x}[n], in the semi-data-aided channel estimation. If the receiver utilizes only the detected symbol vectors indicated by 𝐚{\bf a} as additional pilot signals, the LMMSE channel estimate is updated as

𝐇^(𝐚)=𝐘(𝐚)𝐖LMMSE(𝐚)=𝐘(𝐚)𝐗^H(𝐚)(𝐗^(𝐚)𝐗^H(𝐚)+σ2𝐈Ntx)1,\displaystyle\hat{\bf H}({\bf a})={\bf Y}({\bf a}){\bf W}_{\rm LMMSE}({\bf a})={\bf Y}({\bf a})\mathbf{\hat{X}}^{H}({\bf a})\big{(}\mathbf{\hat{X}}({\bf a})\mathbf{\hat{X}}^{H}({\bf a})+\sigma^{2}\mathbf{I}_{N_{\rm{tx}}}\big{)}^{-1}, (12)

where 𝐘(𝐚)=[𝐘p,𝐲[l1(𝐚)],,𝐲[l𝐚0(𝐚)]]{\bf Y}({\bf a})=\big{[}\mathbf{Y}_{\rm{p}},{\bf y}[l_{1}({\bf a})],\cdots,{\bf y}[l_{\|{\bf a}\|_{0}}({\bf a})]\big{]}, 𝐗(𝐚)=[𝐏,𝐱^[l1(𝐚)],,𝐱^[l𝐚0(𝐚)]]{\bf X}({\bf a})=\big{[}\mathbf{P},\hat{\bf x}[l_{1}({\bf a})],\cdots,\hat{\bf x}[l_{\|{\bf a}\|_{0}}({\bf a})]\big{]}, and li(𝐚)l_{i}({\bf a}) is the index of the ii-th nonzero entry in a vector 𝐚{\bf a}. Note that 𝐚0\|{\bf a}\|_{0} is the number of nonzero entries in 𝐚{\bf a}. Based on the above notations, a symbol vector selection problem for minimizing the MSE of the updated channel estimate is formulated as

𝐚=argmin𝐚{0,1}Tu𝔼[𝐇^(𝐚)𝐇F2]=argmin𝐚{0,1}Tu𝔼[𝐘(𝐚)𝐖LMMSE(𝐚)𝐇F2],\displaystyle{\bf a}^{\star}=\underset{{\bf a}\in\{0,1\}^{T_{\rm u}}}{\operatornamewithlimits{argmin}}~{}\mathbb{E}\big{[}\|\mathbf{\hat{H}}({\bf a})-\mathbf{H}\|_{\rm F}^{2}\big{]}=\underset{{\bf a}\in\{0,1\}^{T_{\rm u}}}{\operatornamewithlimits{argmin}}~{}\mathbb{E}\big{[}\|{\bf Y}({\bf a}){\bf W}_{\rm LMMSE}({\bf a})-\mathbf{H}\|_{\rm F}^{2}\big{]}, (13)

where the expectation is taken with respect to channel and noise distributions. The first key observation is that the distribution of 𝐘(𝐚){\bf Y}({\bf a}) depends on the transmitted symbol vectors associated with 𝐚{\bf a}; thereby, solving the optimization problem in (13) requires perfect knowledge of the first TuT_{\rm u} transmitted symbol vectors at the receiver. Another important observation is that the number of possible choices for symbol vector selection is given by 2Tu2^{T_{\rm u}} which exponentially increases with the number of symbol vector candidates. These observations reveal that directly solving the problem in (13) is very challenging at the receiver in practical systems.

III-C MDP for Symbol Vector Selection

To circumvent the aforementioned challenge, we reformulate the optimization problem in (13) as an MDP which sequentially decides whether to use each detected symbol vector when its reward is a reduction in channel estimation error. In Sec. IV, we will demonstrate how this MDP allows the receiver to approximately but efficiently solves the original problem in (13) using an RL approach. Details of our MDP formulation are elaborated below.

III-C1 State

The state set of the MDP associated with time slot nn is defined as

𝒮n={\displaystyle\mathcal{S}_{n}=\big{\{} (𝐗n,𝐗^n,𝐚n)|\displaystyle\big{(}\mathbf{X}_{n},\mathbf{\hat{X}}_{n},{\bf a}_{n}\big{)}|
𝐗n=[𝐏,𝐱j1,,𝐱j𝐚n0],𝐗^n=[𝐏,𝐱^[l1(𝐚n)],,𝐱^[l𝐚n0(𝐚n)]],𝐚n{0,1}n1},\displaystyle\mathbf{X}_{n}=\big{[}\mathbf{P},\mathbf{x}_{j_{1}},\cdots,\mathbf{x}_{j_{\|{\bf a}_{n}\|_{0}}}\big{]},\mathbf{\hat{X}}_{n}=\big{[}\mathbf{P},\mathbf{\hat{x}}[l_{1}({\bf a}_{n})],\cdots,\mathbf{\hat{x}}[l_{\|{\bf a}_{n}\|_{0}}({\bf a}_{n})]\big{]},{\bf a}_{n}\in\{0,1\}^{n-1}\big{\}}, (14)

where jij_{i} is the candidate index for the next transition at the ii-th nonzero entry in a vector 𝐚n{\bf a}_{n} such that ji𝒦j_{i}\in\mathcal{K}. In (III-C1), 𝐚n{\bf a}_{n} is the set of the actions until the time slot n1n-1. If ai=1a_{i}=1, it indicates that the detected symbol vector at time slot ii will be exploited as additional pilot signals for the data-aided channel estimation. Using this definition, the LMMSE channel estimate obtained at the state Sn=(𝐗n,𝐗^n,𝐚n)𝒮n\mathrm{S}_{n}=\big{(}\mathbf{X}_{n},\mathbf{\hat{X}}_{n},{\bf a}_{n}\big{)}\in\mathcal{S}_{n} is given by

𝐇^(Sn)\displaystyle\mathbf{\hat{H}}\left(\mathrm{S}_{n}\right) =𝐘(Sn)𝐗^nH(𝐗^n𝐗^nH+σ2𝐈Ntx)1,\displaystyle={\bf Y}(\mathrm{S}_{n})\mathbf{\hat{X}}_{n}^{H}\big{(}\mathbf{\hat{X}}_{n}\mathbf{\hat{X}}_{n}^{H}+\sigma^{2}\mathbf{I}_{N_{\rm{tx}}}\big{)}^{-1}, (15)

where 𝐘(Sn)=[𝐘p,𝐲[l1(𝐚n)],,𝐲[l𝐚n0(𝐚n)]]{\bf Y}(\mathrm{S}_{n})=\big{[}\mathbf{Y}_{\rm{p}},{\bf y}[l_{1}({\bf a}_{n})],\cdots,{\bf y}[l_{\|{\bf a}_{n}\|_{0}}({\bf a}_{n})]\big{]}.

III-C2 Reward Function

The reward function of the MDP is defined as the MSE reduction of the channel estimate when transiting from the current state to the next state. Based on this definition, the reward function associated with the state transition from Sn𝒮n\mathrm{S}_{n}\in\mathcal{S}_{n} to Sn+1𝒮n+1\mathrm{S}_{n+1}\in\mathcal{S}_{n+1} is given by

𝖱(Sn,Sn+1)\displaystyle\mathsf{R}\left(\mathrm{S}_{n},\mathrm{S}_{n+1}\right) =1Nrx{𝔼[𝐇^(Sn)𝐇F2]𝔼[𝐇^(Sn+1)𝐇F2]}.\displaystyle=\frac{1}{N_{\rm rx}}\left\{\mathbb{E}\big{[}\|\mathbf{\hat{H}}\left(\mathrm{S}_{n}\right)-\mathbf{H}\|_{\rm F}^{2}\big{]}-\mathbb{E}\big{[}\|\mathbf{\hat{H}}\left(\mathrm{S}_{n+1}\right)-\mathbf{H}\|_{\rm F}^{2}\big{]}\right\}. (16)

III-C3 Action

The action set of the MDP is defined as 𝒜={1,0}\mathcal{A}=\{1,0\} which indicates whether to exploit the current detected symbol vector as an additional pilot signal. For example, the action with a=1a=1 implies that the detected symbol vector will be exploited as the pilot signal.

III-C4 State Transition

From the definitions of the state and action, the current state is updated using the detected symbol vector when a=1a=1; otherwise, the current state remains unchanged. Thus, the state 𝖴(Sn|a)𝒮n+1\mathsf{U}\left(\mathrm{S}_{n}|a\right)\in\mathcal{S}_{n+1} that can be transited to the current Sn=(𝐗n,𝐗^n,𝐚n)𝒮n\mathrm{S}_{n}=(\mathbf{X}_{n},\mathbf{\hat{X}}_{n},{\bf a}_{n})\in\mathcal{S}_{n} is given by

𝖴(Sn|a)\displaystyle\mathsf{U}\left(\mathrm{S}_{n}|a\right) ={([𝐗n,𝐱kn],[𝐗^n,𝐱^[n]],[𝐚n,1]),a=1,(𝐗n,𝐗^n,[𝐚n,0]),a=0.\displaystyle=\begin{cases}\big{(}[\mathbf{X}_{n},\mathbf{x}_{k_{n}}],[\mathbf{\hat{X}}_{n},\mathbf{\hat{x}}[n]],[{\bf a}_{n},1]\big{)},&a=1,\\ \big{(}\mathbf{X}_{n},\mathbf{\hat{X}}_{n},[{\bf a}_{n},0]\big{)},&a=0.\end{cases} (17)

III-C5 Optimal Policy

The optimal policy of the MDP for a state Sn𝒮n\mathrm{S}_{n}\in\mathcal{S}_{n} is defined as

π(Sn)\displaystyle\pi^{\star}\left(\mathrm{S}_{n}\right) =argmaxa𝒜𝖰(Sn,a),\displaystyle=\operatornamewithlimits{argmax}_{a\in\mathcal{A}}\mathsf{Q}\left(\mathrm{S}_{n},a\right), (18)

where 𝖰(Sn,a)\mathsf{Q}\left(\mathrm{S}_{n},a\right) is the Q-value function that represents the optimal sum of the rewards obtained after taking the action a𝒜a\in\mathcal{A} at the state Sn\mathrm{S}_{n}. By the definition in (17), 𝖰(Sn,a)\mathsf{Q}\left(\mathrm{S}_{n},a\right) can be expressed as

𝖰(Sn,a)\displaystyle\mathsf{Q}\left(\mathrm{S}_{n},a\right) =𝖱(Sn,𝖴(Sn|a))+γ𝖵(𝖴(Sn|a)),\displaystyle=\mathsf{R}\left(\mathrm{S}_{n},\mathsf{U}\left(\mathrm{S}_{n}|a\right)\right)+\gamma\mathsf{V}^{\star}\left(\mathsf{U}\left(\mathrm{S}_{n}|a\right)\right), (19)

where 0γ10\leq\gamma\leq 1 is a discounting factor, and 𝖵(Sm)\mathsf{V}^{\star}\big{(}\mathrm{S}_{m}\big{)} is the optimal value function which is the optimal sum of the rewards that can be obtained from the state Sm𝒮m\mathrm{S}_{m}\in\mathcal{S}_{m} with m{n+1,,Tu}m\in\{n+1,\ldots,T_{\rm u}\}. The optimal value function for a state Sm𝒮m\mathrm{S}_{m}\in\mathcal{S}_{m} can be recursively computed as follows:

𝖵(Sm)\displaystyle\mathsf{V}^{\star}\left(\mathrm{S}_{m}\right) =a𝒜π(Sm,a)(𝖱(Sm,𝖴(Sm|a))+γ𝖵(𝖴(Sm|a))),\displaystyle=\sum_{a\in\mathcal{A}}\pi^{\star}\left(\mathrm{S}_{m},a\right)\left(\mathsf{R}\left(\mathrm{S}_{m},\mathsf{U}\left(\mathrm{S}_{m}|a\right)\right)+\gamma\mathsf{V}^{\star}\left(\mathsf{U}\left(\mathrm{S}_{m}|a\right)\right)\right), (20)

where π(Sm,a)\pi^{\star}\left(\mathrm{S}_{m},a\right) is the probability of choosing action aa at the state Sm\mathrm{S}_{m} according to the optimal policy. In Fig. 2, we depict the state-action diagram of the MDP defined above. In this figure, the state Sn\mathrm{S}_{n} is transited to the next state 𝖴(Sn|a)\mathsf{U}\left(\mathrm{S}_{n}|a\right) when taking an action aa. Particularly, when a=1a=1, the state Sn\mathrm{S}_{n} is transited to the state 𝖴(Sn|1)\mathsf{U}\left(\mathrm{S}_{n}|1\right) by exploiting the transmitted symbol index knk_{n}. Based on the state transition and the optimal policy in (18), the states are transited to the next states until the end of data subblock.

Refer to caption
Figure 2: State-action diagram of the original MCTS for a𝒜a\in\mathcal{A} and Sn𝒮n\mathrm{S}_{n}\in\mathcal{S}_{n}.

Characterizing the optimal policy of the above MDP faces two major challenges in practical communication systems. First, the state transition is unknown at the receiver due to the lack of information of the transmitted symbol vectors. Second, the number of the states in this MDP exponentially increases with the length of TuT_{\rm u} (see Fig. 2). To circumvent these challenges, in the following section, we design a computationally-efficient algorithm to solve the MDP without perfect knowledge on the state transition and the reward function.

IV Proposed Channel Estimator via Reinforcement Learning

RL is a type of machine learning that can find the optimal policy of an MDP with unknown or partial information on an environment’s dynamics [27]. In this section, we propose an efficient RL algorithm to approximately but efficiently determine the optimal policy of the MDP in Sec. III-C. We then present the semi-data-aided channel estimator that utilizes the proposed RL algorithm for optimizing the symbol vector selection. We also introduce an additional strategy to improve detection performance after the symbol vector selection in the semi-data-aided channel estimator.

IV-A Proposed RL Algorithm

The key idea of the proposed RL algorithm is to exploit the APP computed from data detection for approximately determining the optimal policy based on MCTS [27, 28, 29]. In the proposed algorithm, we particularly modify the original MCTS to make this approach applicable for the receiver in practical systems. In what follows, we elaborate on the details of the proposed RL algorithm applied to determine the optimal policy for the state Sn𝒮n{\rm S}_{n}\in\mathcal{S}_{n} with n{1,,Tu}n\in\{1,\ldots,T_{\rm u}\}.

IV-A1 Tree Policy and Rollout Policy

The basic idea of MCTS is to determine the best action at the current state by looking ahead the rewards of near-future actions according to a tree policy, while approximating the rewards of distant-future actions according to a rollout policy [27]. Typically, the tree policy is designed to mimic the optimal policy, while the design of the rollout policy focuses more on computational simplicity and tractability. To design an effective tree policy for the proposed algorithm, our intuition is that the higher the reliability of the detected symbol vector is, the higher the probability of selecting the corresponding symbol vector as an additional pilot signal is. Inspired by this intuition, we exploit the APP computed from data detection as a measure of the reliability of the detected symbol vector. We then set the tree policy of the proposed algorithm as

πt(Sm,a)\displaystyle{\pi}^{\rm t}\left(\mathrm{S}_{m},a\right) ={θ^k^m[m],a=1,1θ^k^m[m],a=0,\displaystyle=\begin{cases}\hat{\theta}_{\hat{k}_{m}}[m],&a=1,\\ 1-\hat{\theta}_{\hat{k}_{m}}[m],&a=0,\end{cases} (21)

for every state Sm𝒮m\mathrm{S}_{m}\in\mathcal{S}_{m} with m{n+1,,n+N}m\in\{n+1,\ldots,n+N\}, where NN is the number of the near-future actions taken according to the tree policy. We also denote the sequence of actions randomly chosen by the tree policy in (21) by 𝐚t=[a1t,,aNt]𝒜N{\bf a}^{\rm t}=[{a}_{1}^{\rm t},\cdots,a_{N}^{\rm t}]\in\mathcal{A}^{N}. To determine an effective rollout policy, we introduce a pre-determined threshold ηroll\eta_{\rm roll} such that 0ηroll10\leq\eta_{\rm roll}\leq 1. We then choose the action a=1a=1 if the APP is higher than ηroll\eta_{\rm roll} and a=0a=0 otherwise, i.e.,

πr(Sm)\displaystyle{\pi}^{\rm r}\left(\mathrm{S}_{m}\right) ={1,θ^k^m[m]ηroll,0,θ^k^m[m]<ηroll,\displaystyle=\begin{cases}1,&\hat{\theta}_{\hat{k}_{m}}[m]\geq\eta_{\rm roll},\\ 0,&\hat{\theta}_{\hat{k}_{m}}[m]<\eta_{\rm roll},\end{cases} (22)

for every state Sm𝒮m\mathrm{S}_{m}\in\mathcal{S}_{m} associated with time slot m{n+N+1,,Tu}m\in\{n+N+1,\ldots,T_{\rm u}\}. Our rollout policy is useful to reduce the computational complexity of the value function estimation after NN state transitions. Meanwhile, this policy also mimics the behavior of the tree policy (21) if the detected symbol vector is reliable enough (i.e., θk^m[m]1\theta_{\hat{k}_{m}}[m]\approx 1). We denote the sequence of actions determined by the rollout policy in (22) by 𝐚r=[a1r,,aTunNr]𝒜TunN{\bf a}^{\rm r}=[{a}_{1}^{\rm r},\cdots,a_{T_{\rm u}-n-N}^{\rm r}]\in\mathcal{A}^{T_{\rm u}-n-N}.

Refer to caption
Figure 3: State-action diagram of the approximate MCTS for a𝒜a\in\mathcal{A} and Sn𝒮n\mathrm{S}_{n}\in\mathcal{S}_{n}.

IV-A2 Approximation for Monte Carlo Simulations

In the original MCTS, the optimal value function is estimated through Monte Carlo simulations according to the tree policy and the rollout policy [27]. Unfortunately, a receiver in practical communication systems cannot adopt such simulation-based approach because executing the Monte Carlo simulations requires perfect information of the transmitted symbol vectors at the receiver. To circumvent this limitation, we introduce a virtual state that can mimic the effect of Monte Carlo simulations without actual execution. The virtual state is defined as the state that can be arrived when the true symbol vector exactly behaves like its expectation:

𝔼[𝐱[m]|𝐲[m],𝐇]=j=1Kθj[m]𝐱j,\displaystyle\mathbb{E}\big{[}{\bf x}[m]\big{|}{\bf y}[m],{\bf H}\big{]}=\sum\limits_{j=1}^{K}\theta_{j}[m]\mathbf{x}_{j}, (23)

for m{n,,Tu}m\in\{n,\ldots,T_{\rm u}\}. We refer to the expectation in (23) as an expected symbol vector at time slot mm. Since the receiver cannot compute the exact APP due to the lack of perfect channel knowledge, we use an approximate APP for tracking both the tree policy and the rollout policy. When tracking the tree policy, we use an accurate estimate based on a new channel estimate obtained by taking the series of actions according to the tree policy. Let 𝐚t𝒜N{\bf a}^{\rm t}\in\mathcal{A}^{N} be the sequence of the actions chosen by the tree policy in (21). When taking the actions in 𝐚t{\bf a}^{\rm t} from the state Sn=(𝐗n,𝐗^n,𝐚n)𝒮n{\rm S}_{n}=\big{(}{\bf X}_{n},\hat{\bf X}_{n},{\bf a}_{n}\big{)}\in\mathcal{S}_{n}, the LMMSE channel estimate is given by

𝐇^(Sn;𝐚t)\displaystyle\hat{\bf H}\left(\mathrm{S}_{n};{\bf a}^{\rm t}\right) =𝐘¯(Sn;𝐚t)𝐗¯H(Sn;𝐚t)(𝐗¯(Sn;𝐚t)𝐗¯H(Sn;𝐚t)+σ2𝐈Ntx)1,\displaystyle=\bar{\bf Y}\left(\mathrm{S}_{n};{\bf a}^{\rm t}\right)\bar{\bf X}^{H}(\mathrm{S}_{n};{\bf a}^{\rm t})\big{(}\bar{\bf X}(\mathrm{S}_{n};{\bf a}^{\rm t})\bar{\bf X}^{H}(\mathrm{S}_{n};{\bf a}^{\rm t})+\sigma^{2}\mathbf{I}_{N_{\rm{tx}}}\big{)}^{-1}, (24)

where

𝐗¯(Sn;𝐚t)\displaystyle\bar{\bf X}(\mathrm{S}_{n};{\bf a}^{\rm t}) =[𝐗^n,𝐱~[n+l1(𝐚t)],,𝐱~[n+l𝐚t0(𝐚t)]],\displaystyle=\big{[}\hat{\bf X}_{n},\mathbf{\tilde{x}}[n+l_{1}({\bf a}^{\rm t})],\cdots,\mathbf{\tilde{x}}[n+l_{\|{\bf a}^{\rm t}\|_{0}}({\bf a}^{\rm t})]\big{]},
𝐘¯(Sn;𝐚t)\displaystyle\bar{\bf Y}\left(\mathrm{S}_{n};{\bf a}^{\rm t}\right) =[𝐘(Sn),𝐲[n+l1(𝐚t)],𝐲[n+l𝐚t0(𝐚t)]].\displaystyle=\big{[}{\bf Y}({\rm S}_{n}),{\bf y}[n+l_{1}({\bf a}^{\rm t})]\cdots,{\bf y}[n+l_{\|{\bf a}^{\rm t}\|_{0}}({\bf a}^{\rm t})]\big{]}.

Based on the channel estimate in (24), the APP estimate used for tracking the tree policy is determined as

θ^kt[m]=exp(1σ2𝐲[m]𝐇^(Sn;𝐚t)𝐱k2)j𝒦exp(1σ2𝐲[m]𝐇^(Sn;𝐚t)𝐱j2).\displaystyle\hat{\theta}_{k}^{\rm t}[m]=\frac{\exp\big{(}-\frac{1}{\sigma^{2}}\|\mathbf{y}[m]-\hat{\bf H}\big{(}\mathrm{S}_{n};{\bf a}^{\rm t}\big{)}\mathbf{x}_{k}\|^{2}\big{)}}{\sum_{j\in\mathcal{K}}\exp\big{(}-\frac{1}{\sigma^{2}}\|\mathbf{y}[m]-\hat{\bf H}\big{(}\mathrm{S}_{n};{\bf a}^{\rm t}\big{)}\mathbf{x}_{j}\|^{2}\big{)}}. (25)

Unlike the tree policy, we use the initial estimate of the APP in (11) when tracking the rollout policy, in order to reduce a required computational complexity. Utilizing the above strategy, we approximate the expected symbol vector as

𝐱~[m]={j=1Kθ^jt[m]𝐱j,m{n,n+1,,n+N},j=1Kθ^j[m]𝐱j,m{n+N+1,,Tu}.\displaystyle\mathbf{\tilde{x}}[m]=\begin{cases}\sum\limits_{j=1}^{K}\hat{\theta}_{j}^{\rm t}[m]\mathbf{x}_{j},&m\in\{n,n+1,\ldots,n+N\},\\ \sum\limits_{j=1}^{K}\hat{\theta}_{j}[m]\mathbf{x}_{j},&m\in\{n+N+1,\ldots,T_{\rm u}\}.\end{cases} (26)

Under the assumption of 𝐱[m]=𝐱~[m]{\bf x}[m]=\tilde{\bf x}[m] for m{n,,Tu}m\in\{n,\ldots,T_{\rm u}\}, the virtual state arrived when taking the sequence of actions 𝐚=[a1,,am]𝒜m{\bf a}=[a_{1},\ldots,a_{m}]\in\mathcal{A}^{m} from the state Sn𝒮n\mathrm{S}_{n}\in\mathcal{S}_{n} is given by

𝖴~(Sn|𝐚)\displaystyle\mathsf{\tilde{U}}\left(\mathrm{S}_{n}|{\bf a}\right) =([𝐗n,𝐗~n(𝐚)],[𝐗^n,𝐗^n(𝐚)],[𝐚n,𝐚]),\displaystyle=\left(\big{[}{\bf X}_{n},\tilde{\bf X}_{n}({\bf a})\big{]},\big{[}\hat{\bf X}_{n},\hat{\bf X}_{n}({\bf a})\big{]},[{\bf a}_{n},{\bf a}]\right), (27)

where

𝐗~n(𝐚)\displaystyle\tilde{\bf X}_{n}({\bf a}) =[𝐱~[n+l1(𝐚)],,𝐱~[n+l𝐚0(𝐚)]],\displaystyle=\big{[}\mathbf{\tilde{x}}[n+l_{1}({\bf a})],\cdots,\mathbf{\tilde{x}}[n+l_{\|{\bf a}\|_{0}}({\bf a})]\big{]},
𝐗^n(𝐚)\displaystyle\hat{\bf X}_{n}({\bf a}) =[𝐱^[n+l1(𝐚)],,𝐱^[n+l𝐚0(𝐚)]].\displaystyle=\big{[}\mathbf{\hat{x}}[n+l_{1}({\bf a})],\cdots,\mathbf{\hat{x}}[n+l_{\|{\bf a}\|_{0}}({\bf a})]\big{]}.

By using our policies (21), (22) and virtual state in (27), we depict the state-action diagram of our MCTS approach in Fig. 3. The tree policy that mimics the optimal policy is applied for the time indices {n+1,,n+N}\{n+1,\ldots,n+N\}. Because the tree policy considers the effect of both actions, the number of states to compute is proportional to 2N2^{N}. In contrast, the rollout policy only considers the reliable state which has a higher APP and therefore requires a much lower computational complexity.

IV-A3 Optimal Policy

Based on the MCTS approach explained above, we characterize the optimal policy of the MDP in Sec. III-C in a closed-form expression. This result is given in the following theorem:

Theorem 1

When employing the MCTS approach in Sec. IV-A, the optimal policy of the MDP in Sec. III-C for a state Sn=(𝐗n,𝐗^n,𝐚n)𝒮n\mathrm{S}_{n}=\big{(}\mathbf{X}_{n},\mathbf{\hat{X}}_{n},\mathbf{a}_{n}\big{)}\in\mathcal{S}_{n} is

π(Sn)\displaystyle\pi^{\star}(\mathrm{S}_{n}) =𝕀[𝐚t𝒜Nωnt(𝐚t)Δn([𝐚t,𝐚r])0],\displaystyle=\mathbb{I}\Bigg{[}\sum_{\mathbf{a}^{\rm{t}}\in\mathcal{A}^{N}}\omega_{n}^{\rm t}(\mathbf{a}^{\rm{t}})\Delta_{n}([\mathbf{a}^{\rm{t}},\mathbf{a}^{\rm{r}}])\geq 0\Bigg{]}, (28)

where

ωnt(𝐚)\displaystyle\omega_{n}^{\rm t}(\mathbf{a}) =l=1|𝐚|(θ^k^n+l[n+l])al(1θ^k^n+l[n+l])1al,\displaystyle=\prod\limits_{l=1}^{|{\bf a}|}~{}(\hat{\theta}_{\hat{k}_{n+l}}[n+l])^{a_{l}}\big{(}1-\hat{\theta}_{\hat{k}_{n+l}}[n+l]\big{)}^{1-a_{l}}, (29)
Δn(𝐚)\displaystyle\Delta_{n}(\mathbf{a}) =𝐭n(𝐚)2{σ2+σ4(𝐭n(𝐚)22βn(𝐚))+𝐯n(𝐚)2𝐞n(𝐚)𝐮n(𝐚)+𝐯n(𝐚)2},\displaystyle=\|\mathbf{{t}}_{n}({\bf a})\|^{2}\Big{\{}\sigma^{2}+\sigma^{4}\big{(}\|\mathbf{{t}}_{n}({\bf a})\|^{2}-2{\beta}_{n}({\bf a})\big{)}+\|\mathbf{{v}}_{n}({\bf a})\|^{2}-\|\mathbf{{e}}_{n}({\bf a})-\mathbf{{u}}_{n}({\bf a})+\mathbf{{v}}_{n}({\bf a})\|^{2}\Big{\}}, (30)

and related parameters are defined as

𝐐n(𝐚)\displaystyle{\bf Q}_{n}({\bf a}) =(𝐗^n𝐗^nH+𝐗^n(𝐚)𝐗^nH(𝐚)+σ2𝐈Ntx)1,\displaystyle=\left(\mathbf{\hat{X}}_{n}\mathbf{\hat{X}}_{n}^{H}+\hat{\bf X}_{n}({\bf a})\hat{\bf X}_{n}^{H}({\bf a})+\sigma^{2}\mathbf{I}_{N_{\rm{tx}}}\right)^{-1},
𝐃n(𝐚)\displaystyle{\bf D}_{n}({\bf a}) =𝐗^n(𝐗^n𝐗n)H+𝐗^n(𝐚)(𝐗^n(𝐚)𝐗~n(𝐚))H+σ2𝐈Nrx,\displaystyle=\mathbf{\hat{X}}_{n}(\mathbf{\hat{X}}_{n}-\mathbf{X}_{n})^{H}+\hat{\bf X}_{n}({\bf a})(\hat{\bf X}_{n}({\bf a})-\tilde{\bf X}_{n}({\bf a}))^{H}+\sigma^{2}\mathbf{I}_{N_{\rm{rx}}},
𝐭n(𝐚)\displaystyle{\bf t}_{n}({\bf a}) =11+αn(𝐚)𝐐n(𝐚)𝐱^[n],\displaystyle=\frac{1}{\sqrt{1+\alpha_{n}({\bf a})}}{\bf Q}_{n}({\bf a})\mathbf{\hat{x}}[n],
𝐞n(𝐚)\displaystyle{\bf e}_{n}({\bf a}) =11+αn(𝐚)(𝐱^[n]𝐱~[n]),\displaystyle=\frac{1}{\sqrt{1+\alpha_{n}({\bf a})}}(\mathbf{\hat{x}}[n]-\mathbf{\tilde{x}}[n]),
𝐮n(𝐚)\displaystyle\mathbf{{u}}_{n}({\bf a}) =𝐃nH(𝐚)𝐭n(𝐚),\displaystyle={\bf D}_{n}^{H}({\bf a}){\bf t}_{n}({\bf a}),
𝐯n(𝐚)\displaystyle\mathbf{{v}}_{n}({\bf a}) =1𝐭n(𝐚)2𝐃nH(𝐚)𝐐n(𝐚)𝐭n(𝐚),\displaystyle=\frac{1}{\|{\bf t}_{n}({\bf a})\|^{2}}{\bf D}_{n}^{H}({\bf a}){\bf Q}_{n}({\bf a}){\bf t}_{n}({\bf a}),
αn(𝐚)\displaystyle{\alpha}_{n}({\bf a}) =𝐱^H[n]𝐐n(𝐚)𝐱^[n],\displaystyle=\mathbf{\hat{x}}^{H}[n]{\bf Q}_{n}({\bf a})\mathbf{\hat{x}}[n],
βn(𝐚)\displaystyle{\beta}_{n}({\bf a}) =1𝐭n(𝐚)2𝐭nH(𝐚)𝐐n(𝐚)𝐭n(𝐚).\displaystyle=\frac{1}{\|{\bf t}_{n}({\bf a})\|^{2}}{\bf t}_{n}^{H}({\bf a}){\bf Q}_{n}({\bf a}){\bf t}_{n}({\bf a}).
Proof:

See Appendix A. ∎

The optimal policy in (28) determines the best action at the current state by considering the average reward of all possible NN future actions that can be chosen by the tree policy. In this context, the weight ωnt(𝐚t)\omega_{n}^{\rm t}(\mathbf{a}^{\rm t}) in (29) represents the probability of taking a certain sequence of actions, 𝐚t{\bf a}^{\rm t}, according to the tree policy in (21). As can be seen from Theorem 1, the most prominent feature of the proposed RL algorithm is that the optimal policy has a closed-form expression which can be computed in a deterministic way at the receiver in practical systems.

IV-A4 Low-Complexity Policy

A major limitation of the optimal policy in (28) is that the complexity of computing the policy increases exponentially with the number of the near-future actions, NN. Therefore, computing this policy is not affordable if NN is large, implying that we cannot arbitrarily increase the value of NN to improve the performance of the proposed RL algorithm. To circumvent this limitation, we develop a low-complexity policy that approximates the optimal policy in (28) based on Monte Carlo sampling. Recall that the weighted sum in (28) is nothing but the expectation of Δn([𝐚t,𝐚r])\Delta_{n}([\mathbf{a}^{\rm{t}},\mathbf{a}^{\rm{r}}]) because ωnt(𝐚t)\omega_{n}^{\rm t}(\mathbf{a}^{\rm t}) is the probability of obtaining an action sequence 𝐚t{\bf a}^{\rm t} from the tree policy in (21). Motivated by this observation, we randomly draw NsampleN_{\rm sample} samples of 𝐚t{\bf a}^{\rm t} according to the tree policy. We then compute the empirical mean of Δn([𝐚t,𝐚r])\Delta_{n}([\mathbf{a}^{\rm{t}},\mathbf{a}^{\rm{r}}]) by averaging the values of Δn([𝐚t,𝐚r])\Delta_{n}([\mathbf{a}^{\rm{t}},\mathbf{a}^{\rm{r}}]) computed for NsN_{\rm s} samples. Let Δ^n\hat{\Delta}_{n} be the empirical mean determined by the Monte Carlo sampling approach. Then the optimal policy is approximately determined as π(Sn)=𝕀[Δ^n0]\pi^{\star}(\mathrm{S}_{n})=\mathbb{I}[\hat{\Delta}_{n}\geq 0]. We refer to this policy as a low-complexity policy for the state Sn{\rm S}_{n}. The complexity required for determining the low-complexity policy increases linearly with the number of samples, NsampleN_{\rm sample}, and is independent from NN. Therefore, the overall complexity of the proposed RL algorithm can be significantly reduced by harnessing the low-complexity policy with Nsample2NN_{\rm sample}\ll 2^{N}. It is also possible to reduce a mismatch between the optimal policy and the low-complexity policy by increasing NsampleN_{\rm sample} at the cost of the complexity.

Remark (Applicability to other data detection methods): A key requirement of the proposed RL algorithm is the APPs that can be directly obtained from the MAP data detection method. Despite this requirement, the proposed RL algorithm is universally applicable to any other soft-output data detection method which computes the log-likelihood ratios (LLRs) of transmitted data bits. In this case, the proposed RL algorithm can utilize the APPs computed from the LLRs which can be readily performed at the receiver with a slight increase in the computational complexity.

IV-B Proposed Channel Estimator

The proposed channel estimator adopts the RL algorithm in Sec. IV-A for optimizing the selection of detected symbol vectors utilized as additional pilot signals. The proposed channel estimator is summarized in Algorithm 1. In particular, depending on the choice of a policy determination strategy, the receiver computes either the optimal policy in Step 55 or the low-complexity policy in Steps 771212. In Step 1414, we consider the most probable state transition for the unknown transmitted symbol vector. To address this, the detected symbol vector k^n\hat{k}_{n} is assumed to be the same as the transmitted symbol vector if the action 11 is chosen according to the optimal policy. The corresponding state 𝖴^(Sn|a)𝒮n+1\mathsf{\hat{U}}\left(\mathrm{S}_{n}|a\right)\in\mathcal{S}_{n+1} is given by,

𝖴^(Sn|a)\displaystyle\mathsf{\hat{U}}\left(\mathrm{S}_{n}|a\right) ={([𝐗n,𝐱k^n],[𝐗^n,𝐱^[n]],[𝐚n,1]),a=1,(𝐗n,𝐗^n,[𝐚n,0]),a=0.\displaystyle=\begin{cases}\big{(}[\mathbf{X}_{n},\mathbf{x}_{\hat{k}_{n}}],[\mathbf{\hat{X}}_{n},\mathbf{\hat{x}}[n]],[{\bf a}_{n},1]\big{)},&a=1,\\ \big{(}\mathbf{X}_{n},\mathbf{\hat{X}}_{n},[{\bf a}_{n},0]\big{)},&a=0.\end{cases} (31)

Finally, at time slot TuT_{\rm u}, we can obtain the updated channel estimate 𝐇^u=𝐇^(STu+1)\mathbf{\hat{H}}^{u}=\mathbf{\hat{H}}(\mathrm{S}_{T_{\rm u}+1}).

1 Set S1=(𝐏,𝐏,ϕ)\mathrm{S}_{1}=\left({\bf P},{\bf P},\phi\right).
2 for n=1n=1 to TuT_{\rm u} do
3      Determine 𝐚r{\bf a}^{\rm r} according to (22).
4      if Optimal policy then
5           Compute a=π(Sn)a^{\star}=\pi^{\star}(\mathrm{S}_{n}) from (28).
6          
7           else if Low-complexity policy then
8                Initialize Δ^n=0\hat{\Delta}_{n}=0.
9                for s=1s=1 to NsampleN_{\rm sample} do
10                     Randomly draw 𝐚t{\bf a}^{\rm t} according to (21).
11                     Update Δ^ns1sΔ^n+1sΔn([𝐚t,𝐚r])\hat{\Delta}_{n}\leftarrow\frac{s-1}{s}\hat{\Delta}_{n}+\frac{1}{s}\Delta_{n}\left([\mathbf{a}^{\rm{t}},\mathbf{a}^{\rm{r}}]\right) from (30).
12                    
13                     end for
14                    Set a=π(Sn)=𝕀[Δ^n0]a^{\star}=\pi^{\star}(\mathrm{S}_{n})=\mathbb{I}[\hat{\Delta}_{n}\geq 0].
15                    
16                     end if
17                    Set Sn+1=𝖴^(Sn|a)\mathrm{S}_{n+1}=\mathsf{\hat{U}}\left(\mathrm{S}_{n}|a^{\star}\right) from (31).
18                    
19                     end for
Set 𝐇^u=𝐇^(STu+1)\mathbf{\hat{H}}^{u}=\mathbf{\hat{H}}(\mathrm{S}_{T_{\rm u}+1}) from (15).
Algorithm 1 The proposed semi-data-aided channel estimator
Refer to caption
Figure 4: A block diagram of receive processing with the proposed semi-data-aided channel estimator.

IV-C Re-Detection of Unselected Symbol Vectors

An important byproduct of the proposed semi-data-aided channel estimator is the set of detected symbol vectors that are not selected as pilot signals for channel estimation. Since these vectors are turned out to be not reliable, we can treat them as incorrectly detected symbol vectors. Motivated by this observation, to further improve detection performance, we utilize the final channel estimate determined by Algorithm 1 for re-detecting received signals associated with the symbol vectors not selected by the proposed RL algorithm. Suppose that the final state and the channel estimate of Algorithm 1 is given by S=(𝐗,𝐗^,𝐚){\rm S}^{\star}=\big{(}{\bf X}^{\star},\hat{\bf X}^{\star},{\bf a}^{\star}\big{)} and 𝐇^\hat{\bf H}^{\star}, respectively. Then the set of time slot indices associated with the unselected symbol vectors is expressed as

𝒯0(𝐚)={l|al=0},\displaystyle\mathcal{T}_{0}({\bf a}^{\star})=\{l~{}|~{}a_{l}^{\star}=0\}, (32)

where ala_{l}^{\star} is the ll-th entry of 𝐚{\bf a}^{\star}. The optimal MAP data detection is performed again based on 𝐇^\hat{\bf H}^{\star} for the received signals associated with time slots in 𝒯0(𝐚)\mathcal{T}_{0}({\bf a}^{\star}). This strategy yields

k^n=argmaxk𝒦θ^k[n],n𝒯0(𝐚),\displaystyle\hat{k}_{n}=\operatornamewithlimits{argmax}\limits_{k\in\mathcal{K}}~{}\hat{\theta}_{k}^{\star}[n],~{}~{}\forall n\in\mathcal{T}_{0}({\bf a}^{\star}), (33)

where

θ^k[n]=exp(1σ2𝐲[n]𝐇^𝐱k2)j𝒦exp(1σ2𝐲[n]𝐇^𝐱j2).\displaystyle\hat{\theta}_{k}^{\star}[n]=\frac{\exp\big{(}-\frac{1}{\sigma^{2}}\|\mathbf{y}[n]-\hat{\bf H}^{\star}\mathbf{x}_{k}\|^{2}\big{)}}{\sum_{j\in\mathcal{K}}\exp\big{(}-\frac{1}{\sigma^{2}}\|\mathbf{y}[n]-\hat{\bf H}^{\star}\mathbf{x}_{j}\|^{2}\big{)}}. (34)

In Fig. 4, we illustrate the overall receive processing with the proposed semi-data-aided channel estimator and the re-detection strategy, where 𝒯1(𝐚)={l|al=1}\mathcal{T}_{1}({\bf a}^{\star})=\{l~{}|~{}a_{l}^{\star}=1\} and 𝜽^[n]=[θ^1[n],,θ^K[n]]T\hat{\bm{\theta}}[n]=\big{[}\hat{\theta}_{1}[n],\cdots,\hat{\theta}_{K}[n]\big{]}^{T}. Although the re-detection process requires an additional complexity, this process is executed only once more for a portion of received signals. Therefore, the complexity of our strategy is still lower than that of iterative data-aided channel estimation (e.g., [11, 12, 13, 14, 15, 16]) which requires multiple executions of channel estimation and data detection for the whole received signals.

V Simulation Results

In this section, using simulations, we evaluate the NMSE and BLER of the proposed channel estimator in a coded MIMO system with the MAP data detection method. In these simulations, we consider 4-QAM for symbol mapping and assume that Ntx=2N_{\rm{tx}}=2, Nrx=4N_{\rm{rx}}=4, Tp=4T_{\rm{p}}=4, Tu=200T_{\rm{u}}=200, and Td=2048T_{\rm{d}}=2048. For channel coding, we adopt the rate 1/21/2 turbo code based on parallel concatenated codes with feedforward and feedback polynomial (15,13)(15,13) in octal notation. For performance comparison, we consider the following methods:

  • PCSI: This is an ideal case in which perfect channel state information is available at the receiver (i.e., 𝐇^=𝐇\hat{\bf H}={\bf H}).

  • Pilot-CE: This is a conventional pilot-aided channel estimator described in Sec. II-B.

  • Semi-CE (Opt): This is a semi-data-aided channel estimator when correctly detected symbol vectors are utilized as additional pilot signals by assuming perfect knowledge of transmitted symbol vectors. This can be interpreted as the true optimal policy of the MDP in Sec. III-C.

  • Semi-CE (Pro, Opt): This is a semi-data-aided channel estimator when detected symbol vectors are selected according to the optimal policy determined by the proposed RL algorithm. A re-detection strategy discussed in Sec. IV-C is also adopted.

  • Semi-CE (Pro, Low): This is a semi-data-aided channel estimator when detected symbol vectors are selected according to the low-complexity policy determined by the proposed RL algorithm. A re-detection strategy discussed in Sec. IV-C is also adopted.

  • Semi-CE (All): This is a semi-data-aided channel estimator when all the expected symbol vectors in (23) are utilized as additional pilot signals.

  • Iter-CE: This is an iterative data-aided channel estimator in which the best TuT_{u} virtual pilots are utilized as additional plot signals for every iteration. The number of iterations is set as 44. This method is a slight modification of the method proposed in [16].

We set the parameters of the proposed RL algorithm as (N,Nsample,ηroll)=(8,10,0.5)(N,N_{\rm sample},\eta_{\rm{roll}})=(8,10,0.5) unless specified otherwise. We define a per-bit signal-to-noise ratio (SNR) as Eb/N0=1log2|𝒳|σ2E_{b}/N_{0}=\frac{1}{\log_{2}|\mathcal{X}|\sigma^{2}}, and also define NMSE as 𝐇^𝐇F2𝐇F2\frac{\|\hat{\bf H}-{\bf H}\|_{\rm F}^{2}}{\|{\bf H}\|_{\rm F}^{2}}.

Refer to caption
(a) NMSE performance
Refer to caption
(b) BLER performance
Figure 5: NMSE and BLER of various channel estimators for different per-bit SNRs.

Fig. 5 compares the NMSE and BLER of various channel estimators for different per-bit SNRs. Fig. 5 shows the proposed channel estimator has better NMSE and BLER performances than the conventional pilot-aided channel estimator by exploiting detected symbol vectors as additional pilot signals. It is also shown that the proposed channel estimator outperforms Semi-CE (All) which exploits all the detected symbol vectors without a proper selection. Meanwhile, the SNR gap between the proposed channel estimator and Semi-CE (Opt) is only 0.50.5 dB. These results demonstrate that the proposed RL algorithm effectively selects a set of detected symbol vectors that can improve the performance of the semi-data-aided channel estimation. Another interesting observation is that Semi-CE (Pro, Low) has a similar performance to Semi-CE (Pro, Opt); this result implies that the low-complexity policy, whose complexity is proportional to Nsample=10N_{\rm sample}=10, well approximates the optimal policy, whose complexity is proportional to 2N=8=2562^{N=8}=256, by leveraging a Monte Carlo sampling method. Although Iter-CE achieves the best performance among the considered channel estimators, it significantly increases both the delay and computational complexity of the overall receive processing because this estimator requires repeated executions of both data detection and channel decoding.

Refer to caption
(a) NMSE performance
Refer to caption
(b) BLER performance
Figure 6: Performance comparison of various channel estimators for different pilot lengths.

Fig. 6 compares the NMSE and BLER of various channel estimators for different pilot lengths. Fig. 6 shows that the proposed channel estimator provides significant performance gain compared to the conventional pilot-aided channel estimator regardless of the pilot length. It is also shown that a larger NMSE reduction is achieved in the case of Eb/N0=2E_{b}/N_{0}=-2 dB than in the case of Eb/N0=0E_{b}/N_{0}=0 dB. The reason behind this result is that the number of reliable detected symbol vectors increases as the detection performance improves, which allows the use of a more accurate MCTS approach in the proposed RL algorithm. Another interesting observation in Fig. 6(b) is that the proposed channel estimator with Tp=4T_{\rm{p}}=4 even outperforms Pilot-CE with Tp=8T_{\rm{p}}=8, which implies that the proposed estimator requires fewer pilot signals to achieve the same BLER performance.

Refer to caption
(a) NMSE performance
Refer to caption
(b) BLER performance
Figure 7: Performance comparison of various channel estimators for different TuT_{\rm{u}}.
Refer to caption
Figure 8: NMSE performance of the proposed channel estimator according the number of near-future actions NN.

Fig. 7 compares the NMSE and BLER of the proposed channel estimator for different TuT_{\rm{u}}. Fig. 6(a) shows that the NMSE performance of the proposed channel estimator improves with TuT_{\rm{u}}. This gain is attained by increasing the number of detected symbol vectors that can be utilized as additional pilot signals. Thanks to this gain, in Fig. 6(b), it is shown that the BLER performance with the proposed channel estimator also improves with TuT_{\rm u}. Another important observation is that the improvement of both the NMSE and BLER performances decreases as TuT_{\rm{u}} increases. This result implies that once a sufficient number of additional pilot signals are attained, there is no significant gain by increasing the number of pilot signals, while the computational complexity of data-aided channel estimation is proportional to TuT_{\rm u}. Considering this fact, the semi-data-aided channel estimation is an effective strategy for adjusting the performance-complexity trade-off of data-aided channel estimation, by controlling the length of TuT_{\rm u}.

Fig. 8 compares the NMSE of the proposed channel estimator for different numbers of the near-future actions, NN, in the MCTS approach. Fig. 8 shows that the NMSE performance of the proposed channel estimator improves with NN because increasing this number allows the proposed RL algorithm to accurately estimate near-future rewards. This performance gain, however, is attained at the cost of the computational complexity required for determining the best policy for the MDP. Considering this trade-off, in our simulations, we set N=8N=8 which provides sufficient accuracy for the estimation of the near-future rewards, while preventing from a significant increase in the computational complexity.

Refer to caption
Figure 9: Performance comparison of various channel estimators in time-varying channels.

Fig. 9 compares the BLER of various channel estimators in time-varying channels. To model these channels, we adopt the first-order Gaussian-Markov process in [31, 32] in which the channel matrix at time slot nn is given by

𝐇(n)\displaystyle\mathbf{H}^{(n)} =1ϵ2𝐇(n1)+ϵ𝐄(n),\displaystyle=\sqrt{1-\epsilon^{2}}\mathbf{H}^{(n-1)}+\epsilon\mathbf{E}^{(n)}, (35)

for n{Tp+1,,Td}n\in\mathcal{\{}-{T}_{\rm{p}}{+}1,\ldots,{T}_{\rm{d}}\}, where ϵ[0,1]\epsilon\in[0,1] is a temporal correlation coefficient, and 𝐄(n)Nrx×Ntx\mathbf{E}^{(n)}\in\mathbb{C}^{N_{\rm{rx}}\times N_{\rm{tx}}} is an i.i.d. Gaussian random matrix with zero mean and unit variance. In this simulation, the temporal correlation coefficient is set as ϵ=1.5×102\epsilon=1.5\times 10^{-2} or ϵ=102\epsilon=10^{-2}. Note that PCSI in Fig. 9 assumes perfect CSIR only at the beginning of data transmission (i.e., n=1n=1), while it does not assume perfect channel tracking during data transmission. Fig. 9 shows that the BLER performance loss due to channel estimation error is more severe when the wireless channel varies over time because accurate channel estimation is more challenging in time-varying channels. In particular, when ϵ=1.5×102\epsilon=1.5\times 10^{-2}, PCSI at n=1n=1 still shows severe degradation in the BLER performance if the receiver does not properly track temporal channel variations. Unlike this, the proposed channel estimator is shown to be robust against temporal channel variations because it has a potential to track the channel variations during the first TuT_{\rm u} time slots, by exploiting detected symbol vectors as additional pilot signals.

VI Conclusions

In this paper, we have proposed a semi-data-aided LMMSE channel estimator for MIMO systems. The key idea of the proposed estimator is to selectively exploit detected symbol vectors as additional pilot signals, while optimizing this selection via RL. To this end, we have defined the MDP for symbol vector selection and then developed a novel RL algorithm based on the MCTS approach. Using simulations, we have demonstrated that the proposed channel estimator reduces the NMSE in channel estimation, while improving the BLER of the system, compared to conventional pilot-aided channel estimation. Meanwhile, the proposed channel estimator significantly reduces communication latency for updating a channel estimate compared to conventional iterative data-aided channel estimators. An important future research direction is to develop a semi-data-aided channel estimator for wideband systems by modifying both the MDP and the proposed RL algorithm. It would also be interesting to develop a semi-data-aided channel estimator for time-varying systems by properly defining the reward function of the MDP to consider the effect of temporal channel variations.

Let 𝐂e(Sn)=𝔼[(𝐡^r(Sn)𝐡r)(𝐡^r(Sn)𝐡r)H]\mathbf{C}_{e}\left(\mathrm{S}_{n}\right)=\mathbb{E}\big{[}(\hat{\bf h}_{r}\left(\mathrm{S}_{n}\right)-\mathbf{h}_{r})(\mathbf{\hat{h}}_{r}\left(\mathrm{S}_{n}\right)-\mathbf{h}_{r})^{H}\big{]} be the error covariance matrix between 𝐡r{\bf h}_{r} and 𝐡^r(Sn)\hat{\bf h}_{r}\left(\mathrm{S}_{n}\right), where 𝐡^r(Sn)\hat{\bf h}_{r}\left(\mathrm{S}_{n}\right) are the rr-th row of 𝐇^(Sn)\hat{\bf H}\left(\mathrm{S}_{n}\right). This covariance matrix does not depend on the index of a receive antenna because the channel and the noise distributions are assumed to be equal across different receive antennas. Therefore, the MSE of the channel estimate at the state Sn𝒮n\mathrm{S}_{n}\in\mathcal{S}_{n} is given by

𝔼[𝐇^(Sn)𝐇F2]=NrxTr[𝐂e(Sn)].\displaystyle\mathbb{E}\big{[}\|\mathbf{\hat{H}}\left(\mathrm{S}_{n}\right)-\mathbf{H}\|_{\rm F}^{2}\big{]}=N_{\rm rx}\text{Tr}\left[\mathbf{C}_{e}\left(\mathrm{S}_{n}\right)\right]. (36)

Utilizing this fact, the reward function in (16) associated with the state transition from Sn𝒮n\mathrm{S}_{n}\in\mathcal{S}_{n} to Sn+1𝒮n+1\mathrm{S}_{n+1}\in\mathcal{S}_{n+1} is computed as

𝖱(Sn,Sn+1)=Tr[𝐂e(Sn)𝐂e(Sn+1)].\displaystyle\mathsf{R}\left(\mathrm{S}_{n},\mathrm{S}_{n+1}\right)=\text{Tr}\left[\mathbf{C}_{e}\left(\mathrm{S}_{n}\right)-\mathbf{C}_{e}\left(\mathrm{S}_{n+1}\right)\right]. (37)

Meanwhile, when 𝖴~(Sn|a)𝒮n+1\mathsf{\tilde{U}}\left(\mathrm{S}_{n}|a\right)\in\mathcal{S}_{n+1}, the future reward in (20) can be expressed by exploiting (21), (22), and (27) as

𝖵(𝖴~(Sn|a))\displaystyle\mathsf{{V}}^{\star}\left(\mathsf{\tilde{U}}\left(\mathrm{S}_{n}|a\right)\right) =𝐚t𝒜Nωnt(𝐚t)[m=1N𝖱(𝖴~(Sn|[a,𝐚1:m1t]),𝖴~(Sn|[a,𝐚1:mt]))\displaystyle=\sum_{\mathbf{a}^{\rm{t}}\in\mathcal{A}^{N}}\omega_{n}^{\rm t}(\mathbf{a}^{\rm t})\Bigg{[}\sum_{m=1}^{N}\mathsf{R}\left(\mathsf{\tilde{U}}\left(\mathrm{S}_{n}|[a,\mathbf{a}_{1:m-1}^{\rm{t}}]\right),\mathsf{\tilde{U}}\left(\mathrm{S}_{n}|[a,\mathbf{a}_{1:m}^{\rm{t}}]\right)\right)
+l=1TunN𝖱(𝖴~(Sn|[a,𝐚t,𝐚1:l1r]),𝖴~(Sn|[a,𝐚t,𝐚1:lr]))].\displaystyle~{}~{}~{}+\sum_{l=1}^{T_{\rm{u}}-n-N}\mathsf{R}\left(\mathsf{\tilde{U}}\left(\mathrm{S}_{n}|[a,\mathbf{a}^{\rm{t}},\mathbf{a}_{1:l-1}^{\rm{r}}]\right),\mathsf{\tilde{U}}\left(\mathrm{S}_{n}|[a,\mathbf{a}^{\rm{t}},\mathbf{a}_{1:l}^{\rm{r}}]\right)\right)\Bigg{]}. (38)

where 𝐚1:l=[a1,,al]{\bf a}_{1:l}=[a_{1},\cdots,a_{l}] is a sub-vector of 𝐚{\bf a} when lml\leq m. We assume that 𝐚1:0t\mathbf{a}_{1:0}^{\rm t} is the empty set with a slight abuse of notation. By applying (37) and (VI) into (19) and (20), the Q-value is obtained as

𝖰(Sn,a)\displaystyle\mathsf{{Q}}\left(\mathrm{S}_{n},a\right) =Tr[𝐂e(Sn)𝐚t𝒜Nωnt(𝐚t)𝐂e(𝖴~(Sn|[a,𝐚t,𝐚r]))].\displaystyle=\text{Tr}\left[\mathbf{C}_{e}\left(\mathrm{S}_{n}\right)-\sum_{\mathbf{a}^{\rm{t}}\in\mathcal{A}^{N}}\omega_{n}^{\rm t}(\mathbf{a}^{\rm t})\mathbf{C}_{e}\left(\mathsf{\tilde{U}}\left(\mathrm{S}_{n}|[a,\mathbf{a}^{\rm{t}},\mathbf{a}^{\rm{r}}]\right)\right)\right]. (39)

Then the optimal policy in (18) is expressed as

π(Sn)\displaystyle\pi^{\star}\left(\mathrm{S}_{n}\right) =argmaxa{0,1}𝖰(Sn,a)\displaystyle=\operatornamewithlimits{argmax}_{a\in\{0,1\}}\mathsf{Q}\left(\mathrm{S}_{n},a\right)
=𝕀[𝖰(Sn,1)𝖰(Sn,0)0]\displaystyle=\mathbb{I}\left[\mathsf{Q}\left(\mathrm{S}_{n},1\right)-\mathsf{Q}\left(\mathrm{S}_{n},0\right)\geq 0\right]
=𝕀[𝐚t𝒜Nωnt(𝐚t)Δn(𝐚)0],\displaystyle=\mathbb{I}\left[\sum_{~{}\mathbf{a}^{\rm{t}}\in\mathcal{A}^{N}}\omega_{n}^{\rm t}(\mathbf{a}^{\rm t})\Delta_{n}(\mathbf{a})\geq 0~{}\right], (40)

where

Δn(𝐚)\displaystyle\Delta_{n}(\mathbf{a}) =Tr[𝐂e(𝖴~(Sn|[0,𝐚]))𝐂e(𝖴~(Sn|[1,𝐚]))],\displaystyle=\text{Tr}\big{[}\mathbf{C}_{e}\big{(}\mathsf{\tilde{U}}\left(\mathrm{S}_{n}|[0,\mathbf{a}]\right)\big{)}-\mathbf{C}_{e}\big{(}\mathsf{\tilde{U}}\left(\mathrm{S}_{n}|[1,\mathbf{a}]\right)\big{)}\big{]}, (41)
ωnt(𝐚t)\displaystyle\omega_{n}^{\rm t}(\mathbf{a}^{\rm t}) =l=1|𝐚|πt(𝖴~(Sn|[a,𝐚1:l1t]),al)\displaystyle=\prod\limits_{l=1}^{|{\bf a}|}~{}\pi^{\rm{t}}\left(\mathsf{\tilde{U}}\left(\mathrm{S}_{n}|[a,\mathbf{a}_{1:l-1}^{\rm t}]\right),a_{l}\right)
=l=1|𝐚|(θ^k^n+l[n+l])al(1θ^k^n+l[n+l])1al.\displaystyle=\prod\limits_{l=1}^{|{\bf a}|}~{}(\hat{\theta}_{\hat{k}_{n+l}}[n+l])^{a_{l}}\big{(}1-\hat{\theta}_{\hat{k}_{n+l}}[n+l]\big{)}^{1-a_{l}}. (42)

The remaining task is to characterize Δn(𝐚)\Delta_{n}(\mathbf{a}) in (41). From (17) and (27), the virtual state 𝖴~(Sn|[a,𝐚])𝒮n+m\mathsf{\tilde{U}}\left(\mathrm{S}_{n}|[a,\mathbf{a}]\right)\in\mathcal{S}_{n+m} is characterized as

𝖴~(Sn|[a,𝐚])\displaystyle\mathsf{\tilde{U}}\left(\mathrm{S}_{n}|[a,{\bf a}]\right) =(𝐗n+m,𝐗^n+m,𝐚n+m)\displaystyle=(\mathbf{X}_{n+m},\mathbf{\hat{X}}_{n+m},\mathbf{a}_{n+m})
={([𝐗n,𝐱kn,𝐗~n(𝐚)],[𝐗^n,𝐱^[n],𝐗^n(𝐚)],[𝐚n,1,𝐚])a=1([𝐗n,𝐗~n(𝐚)],[𝐗^n,𝐗^n(𝐚)],[𝐚n,0,𝐚]),a=0.\displaystyle=\begin{cases}\left(\big{[}{\bf X}_{n},\mathbf{x}_{k_{n}},\tilde{\bf X}_{n}({\bf a})\big{]},\big{[}\hat{\bf X}_{n},\mathbf{\hat{x}}[n],\hat{\bf X}_{n}({\bf a})\big{]},[{\bf a}_{n},1,{\bf a}]\right)&a=1\\ \left(\big{[}{\bf X}_{n},\tilde{\bf X}_{n}({\bf a})\big{]},\big{[}\hat{\bf X}_{n},\hat{\bf X}_{n}({\bf a})\big{]},[{\bf a}_{n},0,{\bf a}]\right),&a=0.\end{cases} (43)

Therefore, from (2) and (VI), the distribution of 𝐲¯rH(𝖴~(Sn|[a,𝐚]))\mathbf{\bar{y}}_{r}^{H}\big{(}\mathsf{\tilde{U}}\left(\mathrm{S}_{n}|[a,\mathbf{a}]\right)\big{)} is given by

𝐲¯rH(𝖴~(Sn|[a,𝐚]))\displaystyle\mathbf{\bar{y}}_{r}^{H}\big{(}\mathsf{\tilde{U}}\left(\mathrm{S}_{n}|[a,\mathbf{a}]\right)\big{)} 𝒞𝒩(𝟎𝐚n+m0,𝐗n+mH𝐗n+m+σ2𝐈𝐚n+m0).\displaystyle\sim\mathcal{CN}\left(\mathbf{0}_{\|\mathbf{a}_{n+m}\|_{0}},\mathbf{X}_{n+m}^{H}\mathbf{X}_{n+m}+\sigma^{2}\mathbf{I}_{\|\mathbf{a}_{n+m}\|_{0}}\right). (44)

Using this fact, the error covariance matrix in (41) is expressed as

𝐂e(𝖴~(Sn|[a,𝐚]))=σ2𝐐n([a,𝐚])σ4𝐐n2([a,𝐚])+𝐐n([a,𝐚])𝐃n([a,𝐚])𝐃nH([a,𝐚])𝐐n([a,𝐚]),\displaystyle\mathbf{C}_{e}\big{(}\mathsf{\tilde{U}}\left(\mathrm{S}_{n}|[a,\mathbf{a}]\right)\big{)}=\sigma^{2}\mathbf{Q}_{n}([a,\mathbf{a}])-\sigma^{4}\mathbf{Q}_{n}^{2}([a,\mathbf{a}])+\mathbf{Q}_{n}([a,\mathbf{a}])\mathbf{D}_{n}([a,\mathbf{a}])\mathbf{D}_{n}^{H}([a,\mathbf{a}])\mathbf{Q}_{n}([a,\mathbf{a}]), (45)

where

𝐐n([a,𝐚])\displaystyle\mathbf{Q}_{n}([a,\mathbf{a}]) =(𝐗^n+m𝐗^n+mH+σ2𝐈Ntx)1\displaystyle=\left(\mathbf{\hat{X}}_{n+m}\mathbf{\hat{X}}_{n+m}^{H}+\sigma^{2}\mathbf{I}_{N_{\rm{tx}}}\right)^{-1}
=(a){(𝐗^n𝐗^nH+𝐗^n(𝐚)𝐗^nH(𝐚)+σ2𝐈Ntx)1,a=0,(𝐐n1([0,𝐚])+𝐱^[n]𝐱^H[n])1,a=1,\displaystyle\overset{(a)}{=}\begin{cases}\big{(}\mathbf{\hat{X}}_{n}\mathbf{\hat{X}}_{n}^{H}+\mathbf{\hat{X}}_{n}(\mathbf{a})\mathbf{\hat{X}}_{n}^{H}(\mathbf{a})+\sigma^{2}\mathbf{I}_{N_{\rm{tx}}}\big{)}^{-1},&a=0,\\ \left(\mathbf{Q}_{n}^{-1}([0,\mathbf{a}])+\mathbf{\hat{x}}[n]\mathbf{\hat{x}}^{H}[n]\right)^{-1},&a=1,\end{cases}
𝐃n([a,𝐚])\displaystyle\mathbf{D}_{n}([a,\mathbf{a}]) =𝐗^n+m(𝐗^n+m𝐗n+m)H+σ2𝐈Ntx\displaystyle=\mathbf{\hat{X}}_{n+m}\left(\mathbf{\hat{X}}_{n+m}-\mathbf{X}_{n+m}\right)^{H}+\sigma^{2}\mathbf{I}_{N_{\rm{tx}}}
=(b){𝐗^n(𝐗^n𝐗n)H+𝐗^n(𝐚)(𝐗^n(𝐚)𝐗~n(𝐚))H+σ2𝐈Ntx,a=0,𝐃n([0,𝐚])+𝐱^[n](𝐱^[n]𝐱~[n])H,a=1,\displaystyle\overset{(b)}{=}\begin{cases}\mathbf{\hat{X}}_{n}\big{(}\mathbf{\hat{X}}_{n}-\mathbf{X}_{n}\big{)}^{H}+\mathbf{\hat{X}}_{n}(\mathbf{a})\big{(}\mathbf{\hat{X}}_{n}(\mathbf{a})-\mathbf{\tilde{X}}_{n}(\mathbf{a})\big{)}^{H}+\sigma^{2}\mathbf{I}_{N_{\rm{tx}}},&a=0,\\ \mathbf{D}_{n}([0,\mathbf{a}])+\mathbf{\hat{x}}[n]\left(\mathbf{\hat{x}}[n]-\mathbf{\tilde{x}}[n]\right)^{H},&a=1,\end{cases}

and both (a) and (b) come from (VI). By using the error covariance matrix in (45), Δn(𝐚)\Delta_{n}(\mathbf{a}) in (41) is rewritten as

Δn(𝐚)=σ2Tr[𝐐n([0,𝐚])𝐐n([1,𝐚])]=An(𝐚)σ4Tr[𝐐n2([0,𝐚])𝐐n2([1,𝐚])]=Bn(𝐚)\displaystyle\Delta_{n}(\mathbf{a})=\sigma^{2}\underbrace{\text{Tr}\left[\mathbf{Q}_{n}([0,\mathbf{a}])-\mathbf{Q}_{n}([1,\mathbf{a}])\right]}_{=A_{n}({\bf a})}-\sigma^{4}\underbrace{\text{Tr}\left[\mathbf{Q}_{n}^{2}([0,\mathbf{a}])-\mathbf{Q}_{n}^{2}([1,\mathbf{a}])\right]}_{=B_{n}({\bf a})}
+Tr[𝐐n([0,𝐚])𝐃n([0,𝐚])𝐃nH([0,𝐚])𝐐n([0,𝐚])𝐐n([1,𝐚])𝐃n([1,𝐚])𝐃nH([1,𝐚])𝐐n([1,𝐚])]=Cn(𝐚).\displaystyle+\underbrace{\text{Tr}\left[\mathbf{Q}_{n}([0,\mathbf{a}])\mathbf{D}_{n}([0,\mathbf{a}])\mathbf{D}_{n}^{H}([0,\mathbf{a}])\mathbf{Q}_{n}([0,\mathbf{a}])-\mathbf{Q}_{n}([1,\mathbf{a}])\mathbf{D}_{n}([1,\mathbf{a}])\mathbf{D}_{n}^{H}([1,\mathbf{a}])\mathbf{Q}_{n}([1,\mathbf{a}])\right]}_{=C_{n}({\bf a})}. (46)

By the matrix inversion lemma, the matrix 𝐐n([1,𝐚])\mathbf{Q}_{n}([1,\mathbf{a}]) is rewritten as

𝐐n([1,𝐚])\displaystyle\mathbf{Q}_{n}([1,\mathbf{a}]) =𝐐n([0,𝐚])𝐐n([0,𝐚])𝐱^[n]𝐱^H[n]𝐐n([0,𝐚])1+𝐱^H[n]𝐐n([0,𝐚])𝐱^[n].\displaystyle=\mathbf{Q}_{n}([0,\mathbf{a}])-\frac{\mathbf{Q}_{n}([0,\mathbf{a}])\mathbf{\hat{x}}[n]\mathbf{\hat{x}}^{H}[n]\mathbf{Q}_{n}([0,\mathbf{a}])}{1+\mathbf{\hat{x}}^{H}[n]\mathbf{Q}_{n}([0,\mathbf{a}])\mathbf{\hat{x}}[n]}. (47)

From (47), the first term of the right-hand-side (RHS) of (VI) is expressed as

An(𝐚)\displaystyle A_{n}({\bf a}) =Tr[𝐐n([0,𝐚])𝐱^[n]𝐱^H[n]𝐐n([0,𝐚])1+𝐱^H[n]𝐐n([0,𝐚])𝐱^[n]]=𝐭n(𝐚)2,\displaystyle=\text{Tr}\left[\frac{\mathbf{Q}_{n}([0,\mathbf{a}])\mathbf{\hat{x}}[n]\mathbf{\hat{x}}^{H}[n]\mathbf{Q}_{n}([0,\mathbf{a}])}{1+\mathbf{\hat{x}}^{H}[n]\mathbf{Q}_{n}([0,\mathbf{a}])\mathbf{\hat{x}}[n]}\right]=\|\mathbf{t}_{n}(\mathbf{a})\|^{2}, (48)

where 𝐭n(𝐚)=11+αn(𝐚)𝐐n([0,𝐚])𝐱^[n]{\bf t}_{n}({\bf a})=\frac{1}{\sqrt{1+\alpha_{n}({\bf a})}}{\bf Q}_{n}([0,{\bf a}])\mathbf{\hat{x}}[n] with αn(𝐚)=𝐱^H[n]𝐐n([0,𝐚])𝐱^[n]{\alpha}_{n}({\bf a})=\mathbf{\hat{x}}^{H}[n]{\bf Q}_{n}([0,{\bf a}])\mathbf{\hat{x}}[n]. The second term of the RHS of (VI) is expressed as

Bn(𝐚)\displaystyle B_{n}({\bf a}) =Tr[(𝐐n([0,𝐚])𝐐n([1,𝐚]))H(𝐐n([0,𝐚])+𝐐n([1,𝐚]))]\displaystyle=\text{Tr}\left[\left(\mathbf{Q}_{n}([0,\mathbf{a}])-\mathbf{Q}_{n}([1,\mathbf{a}])\right)^{H}\left(\mathbf{Q}_{n}([0,\mathbf{a}])+\mathbf{Q}_{n}([1,\mathbf{a}])\right)\right]
=2βn(𝐚)𝐭n(𝐚)2𝐭n(𝐚)4,\displaystyle=2\beta_{n}(\mathbf{a})\|\mathbf{t}_{n}(\mathbf{a})\|^{2}-\|\mathbf{t}_{n}(\mathbf{a})\|^{4}, (49)

where βn(𝐚)=1𝐭n(𝐚)2𝐭nH(𝐚)𝐐n([0,𝐚])𝐭n(𝐚)\beta_{n}(\mathbf{a})=\frac{1}{\|{\bf t}_{n}({\bf a})\|^{2}}{\bf t}_{n}^{H}({\bf a}){\bf Q}_{n}([0,{\bf a}]){\bf t}_{n}({\bf a}). The last term of the RHS of (VI) is computed as

Cn(𝐚)\displaystyle C_{n}({\bf a}) =Tr[𝐐n([0,𝐚])𝐃n([0,𝐚])𝐃nH([0,𝐚])𝐐n([0,𝐚])\displaystyle=\text{Tr}\Big{[}\mathbf{Q}_{n}([0,\mathbf{a}])\mathbf{D}_{n}([0,\mathbf{a}])\mathbf{D}_{n}^{H}([0,\mathbf{a}])\mathbf{Q}_{n}([0,\mathbf{a}])
(𝐐n([0,𝐚])𝐭n(𝐚)𝐭nH(𝐚))(𝐃n([0,𝐚])+𝐱^[n](𝐱^[n]𝐱~[n])H)\displaystyle~{}~{}~{}~{}~{}~{}-\left(\mathbf{Q}_{n}([0,\mathbf{a}])-\mathbf{t}_{n}(\mathbf{a})\mathbf{t}_{n}^{H}(\mathbf{a})\right)\big{(}\mathbf{D}_{n}([0,\mathbf{a}])+\mathbf{\hat{x}}[n](\mathbf{\hat{x}}[n]-\mathbf{\tilde{x}}[n])^{H}\big{)}
×(𝐃n([0,𝐚])+𝐱^[n](𝐱^[n]𝐱~[n])H)H(𝐐n([0,𝐚])𝐭n(𝐚)𝐭nH(𝐚))]\displaystyle~{}~{}~{}~{}~{}~{}\times\big{(}\mathbf{D}_{n}([0,\mathbf{a}])+\mathbf{\hat{x}}[n](\mathbf{\hat{x}}[n]-\mathbf{\tilde{x}}[n])^{H}\big{)}^{H}\left(\mathbf{Q}_{n}([0,\mathbf{a}])-\mathbf{t}_{n}(\mathbf{a})\mathbf{t}_{n}^{H}(\mathbf{a})\right)\Big{]}
=𝐭n(𝐚)2(𝐯n(𝐚)2𝐞n(𝐚)𝐮n(𝐚)+𝐯n(𝐚)2),\displaystyle=\|\mathbf{t}_{n}(\mathbf{a})\|^{2}\left(\|\mathbf{v}_{n}(\mathbf{a})\|^{2}-\|\mathbf{e}_{n}(\mathbf{a})-\mathbf{u}_{n}(\mathbf{a})+\mathbf{v}_{n}(\mathbf{a})\|^{2}\right), (50)

where 𝐞n(𝐚)=11+αn(𝐚)(𝐱^[n]𝐱~[n]){\bf e}_{n}({\bf a})=\frac{1}{\sqrt{1+\alpha_{n}({\bf a})}}(\mathbf{\hat{x}}[n]-\mathbf{\tilde{x}}[n]), 𝐯n(𝐚)=1𝐭n(𝐚)2𝐃nH(𝐚)𝐐n(𝐚)𝐭n(𝐚)\mathbf{{v}}_{n}({\bf a})=\frac{1}{\|{\bf t}_{n}({\bf a})\|^{2}}{\bf D}_{n}^{H}({\bf a}){\bf Q}_{n}({\bf a}){\bf t}_{n}({\bf a}), and 𝐮n(𝐚)=𝐃nH(𝐚)𝐭n(𝐚)\mathbf{{u}}_{n}({\bf a})={\bf D}_{n}^{H}({\bf a}){\bf t}_{n}({\bf a}). Applying (48)–(VI) into (41) yields

Δn(𝐚)\displaystyle\Delta_{n}(\mathbf{a}) =𝐭n(𝐚)2(σ2+σ4(𝐭n(𝐚)22βn(𝐚))+𝐯n(𝐚)2𝐞n(𝐚)𝐮n(𝐚)+𝐯n(𝐚)2).\displaystyle=\|\mathbf{t}_{n}(\mathbf{a})\|^{2}\left(\sigma^{2}+\sigma^{4}\left(\|\mathbf{t}_{n}(\mathbf{a})\|^{2}-2\beta_{n}(\mathbf{a})\right)+\|\mathbf{v}_{n}(\mathbf{a})\|^{2}-\|\mathbf{e}_{n}(\mathbf{a})-\mathbf{u}_{n}(\mathbf{a})+\mathbf{v}_{n}(\mathbf{a})\|^{2}\right). (51)

Finally, we obtain the result in (28) from (40) with (51) and (42), where 𝐐n(𝐚)=𝐐n([0,𝐚])\mathbf{Q}_{n}(\mathbf{a})=\mathbf{Q}_{n}([0,\mathbf{a}]) and 𝐃n(𝐚)=𝐃n([0,𝐚])\mathbf{D}_{n}(\mathbf{a})=\mathbf{D}_{n}([0,\mathbf{a}]).

References

  • [1] Y.-S. Jeon, J. Li, N. Tavangaran, and H. V. Poor, “Data-Aided Channel Estimator for MIMO Systems via Reinforcement Learning,” in Proc. IEEE Int. Conf. Commun. (ICC), Dublin, Ireland, Jun. 2020.
  • [2] G. J. Foschini, “Layered Space-Time Architecture for Wireless Communication in a Fading Environment When Using Multi-Element Antennas,” Bell Labs Tech. J., vol. 1, no. 2, pp. 41–-59, Aut. 1996.
  • [3] I. E. Telatar, “Capacity of Multi-Antenna Gaussian Channels,” Europ. Trans. Telecommun., vol. 10, pp. 585–-595, Nov./Dec. 1999.
  • [4] L. Zheng and D. N. C. Tse, “Diversity and Multiplexing: A Fundamental Tradeoff in Multiple-Antenna Channels,” IEEE Trans. Inf. Theory, vol. 49, no. 5, pp. 1073-–1096, May 2003.
  • [5] O. Simeone, Y. Bar-Ness and U. Spagnolini, “Pilot-Based Channel Estimation for OFDM Systems by Tracking the Delay-Subspace,” IEEE Trans. Wireless Commun., vol. 3, pp. 315–325, Jan. 2004.
  • [6] H. M. Kim, D. Kim, T. K. Kim, and G. H. Im, “Frequency Domain Channel Estimation for MIMO SC-FDMA Systems With CDM Pilots,” J. Commun. Networks, vol. 16, no. 4, pp. 447–457, Aug. 2014.
  • [7] M. Biguesh and A. B. Gershman, “Training-Based MIMO Channel Estimation: A Study of Estimator Tradeoffs and Optimal Training Signals,” IEEE Trans. Signal Process., vol. 54, no. 3, pp. 884-–893, Mar. 2006.
  • [8] M. K. Ozdemir and H. Arslan, “Channel Estimation for Wireless OFDM Systems,” IEEE Commun. Surveys Tuts., vol 9, no. 2, pp. 18–-48, Jul. 2007.
  • [9] A. Dowler, A. Nix, and J. McGeehan, “Data-Derived Iterative Channel Estimation with Channel Tracking for a Mobile Fourth Generation Wide Area OFDM System,” in Proc. IEEE Global Telecommun. Conf. (GLOBECOM), Dec. 2003.
  • [10] M. Liu, M. Crussière, and J.-F. Hélard, “A Novel Data-Aided Channel Estimation With Reduced Complexity for TDS-OFDM System,” IEEE Trans. Broadcast., vol. 58, no. 2, pp. 247–260, Jun. 2012.
  • [11] D. Kim, H. M. Kim, and G. H. Im, “Iterative Channel Estimation with Frequency Replacement for SC-FDMA Systems,” IEEE Trans. Commun., vol. 60, no. 7, pp. 1877–1888, Jul. 2012.
  • [12] D. Verenzuela, E. Björnson, X. Wang, M. Arnold, and S. t. Brink, “Massive-MIMO Iterative Channel Estimation and Decoding (MICED) in the Uplink,” IEEE Trans. Commun., vol. 68, no. 2, pp. 854–870, Feb. 2020.
  • [13] J. Ma and L. Ping, “Data-Aided Channel Estimation in Large Antenna Systems,” IEEE Trans. Signal Process., vol. 62, no. 12, pp. 3111–3124, Jun. 2014.
  • [14] C. Huang, L. Liu, C. Yuen, and S. Sun, “Iterative Channel Estimation Using LSE and Sparse Message Passing for mmWave MIMO Systems,” IEEE Trans. Signal Process., vol. 67, no. 1, pp. 245–259, Jan. 2018.
  • [15] M. Zhao, Z. Shi, and M. C. Reed, “Iterative Turbo Channel Estimation for OFDM System over Rapid Dispersive Fading Channel,” IEEE Trans. Wireless Commun., vol. 7, no. 8, Aug. 2008.
  • [16] S. Park, B. Shim, and J. W. Choi, “Iterative Channel Estimation Using Virtual Pilot Signals for MIMO-OFDM Systems,” IEEE Trans. Signal Process., vol. 63, no. 12, pp. 3032–3045, Jun. 2015.
  • [17] S. Park, J. W. Choi, J. Y. Seol, and B. Shim, “Expectation-Maximization-based Channel Estimation for Multiuser MIMO Systems,” IEEE Trans. Commun., vol. 65, no. 6, pp. 2397–2410, Jun. 2017.
  • [18] D. Neumann, T. Wiese, and W. Utschick, “Learning the MMSE Channel Estimator,” IEEE Trans. Signal Process., vol. 66, no. 11, pp. 2905–2917, Jun. 2018.
  • [19] J. Zhang, X. Ma, J. Qi, and S. Jin, “Designing Tensor-Train Deep Neural Networks for Time-Varying MIMO Channel Estimation,” IEEE J. Sel. Topics Signal Process., vol. 15, no. 3, pp. 759–773, Apr. 2021.
  • [20] H. He, C. K. Wen, S. Jin, and G. Y. Li, “Deep Learning-Based Channel Estimation for Beamspace mmWave Massive MIMO Systems,” IEEE Wireless Commun. Lett., vol. 7, no. 5, pp. 852–855, Oct. 2018.
  • [21] Y. Yang, F. Gao, X. Ma, and S. Zhang, “Deep Learning-Based Channel Estimation for Doubly Selective Fading Channels,” IEEE Access, vol. 7, pp. 36579–36589, Mar. 2019.
  • [22] C. J. Chun, J. M. Kang, and I. M. Kim, “Deep Learning-Based Channel Estimation for Massive MIMO Systems,” IEEE Wireless Commun. Lett., vol. 8, no. 4, pp. 1228–1231, Aug. 2019.
  • [23] Y. Qiang, X. Shao, and X. Chen, “A Model-Driven Deep Learning Algorithm for Joint Activity Detection and Channel Estimation,” IEEE Commun. Lett., vol. 24, no. 11, pp. 2508–2512, Nov. 2020.
  • [24] Y. Wei, M. Zhao, M. Zhao, M. Lei, and Q. Yu, “An AMP-Based Network With Deep Residual Learning for mmWave Beamspace Channel Estimation,” IEEE Wireless Commun. Lett., vol. 8, no. 4, pp. 1289–1292, Aug. 2019.
  • [25] H. He, C.-K. Wen, S. Jin, and G. Y. Li, “Model-Driven Deep Learning for MIMO Detection,” IEEE Trans. Signal Process., vol. 68, pp. 1702–1715, Feb. 2020.
  • [26] X. Ma, Z. Gao, F. Gao, and M. D. Renzo, “Model-Driven Deep Learning Based Channel Estimation and Feedback for Millimeter-Wave Massive Hybrid MIMO Systems,” IEEE J. Sel. Areas Commun., vol. 39, no. 8, pp. 2388–2406, Aug. 2021.
  • [27] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, Cambridge, MA: The MIT Press, 2018.
  • [28] C. B. Browne et al., “A Survey of Monte Carlo Tree Search Methods,” IEEE Trans. Comput. Intell. AI in Games, vol. 4, no. 1, pp. 1–43, Mar. 2012.
  • [29] T. Vodopivec, S. Samothrakis, and B. Ster, “On Monte Carlo Tree Search and Reinforcement Learning,” J. Artif. Intell. Res., vol. 60, pp. 881–936, Dec. 2017.
  • [30] Y.-S. Jeon, N. Lee, and H. V. Poor, “Robust Data Detection for MIMO Systems with One-Bit ADCs: A Reinforcement Learning Approach,” IEEE Trans. Wireless Commun., vol. 19, no. 3, pp. 1663–1676, Mar. 2020.
  • [31] M. Dong, L. Tong, and B. M. Sadler, “Optimal Insertion of Pilot Symbols for Transmissions over Time-Varying Flat Fading Channels,” IEEE Trans. Signal Process., vol. 52, no. 5, pp. 1403–1418, May 2004.
  • [32] T. K. Kim, Y.-S. Jeon, and M. Min, “Training Length Adaptation for Reinforcement Learning-Based Detection in Time-Varying Massive MIMO Systems With One-Bit ADCs,” IEEE Trans. Veh. Technol., vol. 70, no. 7, pp. 6999–7011, Jul. 2021.