Theoretical Analysis of Deep Neural Networks in Physical Layer Communication

Jun Liu, Haitao Zhao, , Dongtang Ma, , Kai Mei and Jibo Wei Manuscript received February 15, 2022; revised July 3, 2022; accepted August 17, 2022. This work was supported in part by National Natural Science Foundation of China (NSFC) under Grant 61931020, 61372099 and 61601480. This paper has been presented in part at the 2022 IEEE Wireless Communications and Networking Conference Workshops [1]. (Corresponding author: Jun Liu.)Jun Liu, Haitao Zhao, Dongtang Ma, Kai Mei and Jibo Wei are with the College of Electronic Science and Technology, National University of Defense Technology, Changsha 410073, China (E-mail: {liujun15, haitaozhao, dongtangma, meikai11, wjbhw}@nudt.edu.cn).

Abstract

Recently, deep neural network (DNN)-based physical layer communication techniques have attracted considerable interest. Although their potential to enhance communication systems and superb performance have been validated by simulation experiments, little attention has been paid to the theoretical analysis. Specifically, most studies in the physical layer have tended to focus on the application of DNN models to wireless communication problems but not to theoretically understand how does a DNN work in a communication system. In this paper, we aim to quantitatively analyze why DNNs can achieve comparable performance in the physical layer comparing with traditional techniques, and also drive their cost in terms of computational complexity. To achieve this goal, we first analyze the encoding performance of a DNN-based transmitter and compare it to a traditional one. And then, we theoretically analyze the performance of DNN-based estimator and compare it with traditional estimators. Third, we investigate and validate how information is flown in a DNN-based communication system under the information theoretic concepts. Our analysis develops a concise way to open the “black box” of DNNs in physical layer communication, which can be applied to support the design of DNN-based intelligent communication techniques and help to provide explainable performance assessment.

Index Terms:

Theoretical analysis, deep neural network (DNN), physical layer communication, information theory.

I Introduction

The mathematical theories of communication systems have been developed dramatically since Claude Elwood Shannon’s monograph “A mathematical theory of communication” [2] provides the foundation of digital communication. However, the wireless channel-related gap between theory and practice still needs to be filled due to the difficulty of precisely modeling wireless channels. Recently, deep neural network (DNN) has drawn a lot of attention as a powerful tool to solve science and engineering problems such as protein structure prediction [3], image recognition [4], speech recognition [5] and natural language processing [6] that are virtually impossible to explicitly formulate. These promising approaches motivate researchers to implement DNNs in existing physical layer communication.

In order to mitigate the gap, a natural thought is to let a DNN to jointly optimize a transmitter and a receiver for a given channel model without being limited to component-wise optimization. Autoencoders (AEs) are considered as a tool to solve this problem. An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. An autoencoder has two main parts: an encoder that maps the input into the code, and a decoder that maps the code to a reconstruction of the input. This structure is equivalent to the concept of a communication system. Along this thread, pure data-driven AE-based end-to-end communication systems are firstly proposed to jointly optimize transmitter and receiver components [7, 8, 9, 10]. And then, T. O’Shea and J. Hoydis consider the linear and nonlinear steps of processing the received signal as a radio transformer network (RTN) in [7], which can be integrated into the end-to-end training process. The idea of end-to-end learning of communication system and RTN through DNN is extended to orthogonal frequency division multiplexing (OFDM) in [11] and multiple-input multiple-output (MIMO) in [12].

Unfortunately, training these DNNs in practical wireless channels is never a trivial issue. First, these methods require the reliable feedback links. As Shannon once described in [2], the fundamental problem of communication is described as “reproducing at one point either exactly or approximately a message selected at another point”. But, AE-based methods present a “chicken and egg” problem. That is to say, first need a reliable communications system to do the error back-propagation to actually train an end-to-end communication system for you. This leads to the paradox [13]. To tackle this problem, DNNs are trained offline and then tested online in practical applications. However, this implementation strategy leads to the second problem that these methods assume the availability of the explicit channel model to calculate the gradients for the optimization. Still, the unavailability of perfect channel information forces these methods to adopt simulation-based strategy to train the DNN, which usually results the model mismatch problem. Specifically, the DNN models trained offline show significant performance degradation in test unless both training and test sets are subjected to the same probability distribution. Another resolution is to sample a wireless channel through transmitting probe signal from a neural network-based transmitter. For example, since an AE requires a differentiable channel model, the proposed method in reference [14] calculates the gradient w.r.t the neural network-based transmitter’s parameters through sampling of the channel distribution. Similarly, reference [15] utilizes stochastic perturbation technique to train an end-to-end communication system without relying on explicit channel models but the number of training samples is still prohibitive.

Another idea is to estimate the channel as accurate as possible and recover channel state information (CSI) by implementing a DNN at the side of a receiver so that the effects of fading could be reduced. This strategy usually can be divided into two main categories, one using pure data to train a DNN (also known as the data-driven) and the other combining data and current models to train a DNN (also known as the model-driven). In the data-driven manner, the neural networks (NNs) are optimized merely over a large training data set labeled by true channel responses and a mathematically tractable model is unnecessary [16]. The authors of [17] propose a data-driven end-to-end DNN-based CSI compression feedback and recovery mechanism which is further extended with long short-term memory (LSTM) to tackle time-varying massive MIMO channels [18]. To achieve better estimation performance and reduce computation cost, a compact and flexible deep residual network architecture is proposed to conduct channel estimation for an OFDM system based on downlink pilots in [19]. Nevertheless, the performance of the data-driven approaches heavily depends on an enormous amount of labeled data which cannot be easily obtained in wireless communication. To address this issue, a plethora of model-driven research gradually has been carried out to achieve efficient receivers [20, 21, 22, 23, 24]. Instead of only using the enormous size of labeled data, in the model-driven manner, domain knowledge is also used to construct a DNN [25]. For example, in model-driven channel estimation, the least-square (LS) estimations usually are fed into a DNN, and then the DNN yields the enhanced channel estimates. Furthermore, in order to mitigate the disturbances, in addition to Gaussian noise, such as channel fading and nonlinear distortion, and further reduce the computation complexity of training, [26] proposes an online fully complex extreme learning machine (ELM)-based channel estimation and equalization scheme.

Comparing with traditional physical layer communication systems, the above-mentioned DNN-based techniques show competitive performance by simulation experiments. However, the dynamics behind the DNN in physical layer communication remains unknown. In the domain of information theory, a plethora of research has been conducted to investigate the process of learning. In [27], Tishby et al. propose the information bottleneck (IB) method which provides a framework for discussing a variety of problems in signal processing, learning, etc. Then, in [28], DNNs are analyzed via the theoretical framework of the IB principle. In [29], the variants of the IB problem are discussed. In [30], Tishby’s centralized IB method is generalized to the distributed setting. Reference [31] considers the IB problem of a Rayleigh fading MIMO channel with an oblivious relay. However, the considered problems in these researches are different from that in wireless communication standpoint in which the complexity-relevance tradeoffs usually are not considered. Moreover, there are still three major problems. (i) Although it has been shown by simulations that AE-based end-to-end communication systems can approach the optimal symbol error rate (SER), i.e., the SER of a system using optimal constellation, the quantitative comparative analysis has not been conducted. (ii) As a module in a receiver, how does a DNN process information has not been quantitatively investigated. (iii) The methodology to design data sets and the structure of a neural network, which plays an important role in the neural network-based channel estimation, have not been the theoretically studied.

In this paper, we attempt to first give a mathematical explanation to reveal the mechanism of end-to-end DNN-based communication systems. Then, we try to unveil the role of the DNNs in the task of channel estimation. We believe that we have developed a concise way to open as well as understand the “black box” of DNNs in physical layer communication, and hence our main contributions of this paper are fourfold:

•

Instead of proposing a scheme combining a DNN with a typical communication system, we analyze the behaviors of a DNN-based communication system from the perspectives of the whole DNN (communication system), encoder (transmitter) and decoder (receiver). And we also analyze and compare the performance of the DNN-based transmitter with the conventional method, i.e., the gradient-search algorithm, in terms of both convergence properties and computational complexity.
•

We consider the task of channel estimation as a typical inference problem. With the information theory, we analyze and compare the performance of the DNN-based channel estimation with LS and linear minimum mean-squared error (LMMSE) estimators. Furthermore, we derive the analytical relation between the hyperparameters, as well as training sets, and the performance.
•

We conduct computer simulations and the results verify that the constellations produced by AEs are equivalent to the (locally) optimum constellations obtained by the gradient-search algorithm which minimize the asymptotic probability of error in Gaussian noise under an average power constraint.
•

Through simulation experiments, our theoretical analysis is validated, and the information flow in the DNNs in the task of channel estimation is estimated by using matrix-based functional of Rényi’s $\alpha$ -entropy to approximate Shannon’s entropy.

To the best of our knowledge, there are typically two approaches to integrate DNN with communication systems. (i) Holistic approach treats a communication system as an end-to-end process, which uses an AE to replace a whole communication system. (ii) Phase-oriented approach, which only investigates the application of DNN in a certain module of communication process [32]. Therefore, without loss of generality, we mainly investigate two cases: the AE-based communication system and the DNN independently deployed at a receiver.

We note that a shorter conference version of this paper has appeared in IEEE Wireless Communications and Networking Conference (2022). Our initial conference paper gives preliminary simulation results. This manuscript provides comprehensive analysis and proof.

The remainder of this paper is organized as follows. In Section II, we give the system model and then formulate the problem in Section III and IV . Next, simulation results are presented in Section V. Finally, the conclusions are drawn in Section VI.

Notations: The notations adopted in the paper are as follows. We use boldface lowercase $\bf{x}$ , capital letters $\bf{X}$ and calligraphic letters $\mathcal{X}$ to denote column vectors, matrices and sets respectively. For a matrix $\bf{X}$ , we use ${{\bf{X}}_{ij}}$ to denote its $\left({i,j}\right)$ -th entry. For a vector $\bf{x}$ , we use ${\left\|{\bf{x}}\right\|_{2}}$ to denote the Euclidean norm. For a $m\times n$ matrix $\bf{X}$ , we use ${\left\|{\bf{X}}\right\|_{F}}=\sqrt{\sum\limits_{i}^{m}{\sum\limits_{j}^{n}{{{\left|{{{\bf{X}}_{ij}}}\right|}^{2}}}}}$ to denote Frobenius norm and ${\left\|{\bf{X}}\right\|_{2}}={\sigma_{\max}}\left({\bf{X}}\right)$ to denote the operator norm, where ${\sigma_{\max}}\left({\bf{X}}\right)$ represents the largest singular value of matrix $\bf{X}$ . If a matrix $\bf{X}$ is positive semi-definite, we use ${\lambda_{\min}}\left({\bf{X}}\right)$ to denote its smallest eigenvalue. We use $\left\langle{\cdot,\cdot}\right\rangle$ to denote the standard Euclidean inner product between two vectors or matrices. We let $[n]=\left\{{1,2,\ldots n}\right\}$ . We use ${\mathcal{N}_{d}}\left({{\bf{0}},{{\bf{I}}_{d}}}\right)$ to denote $d$ -dimensional standard Gaussian distribution. We also use $O\left(\cdot\right)$ to denote standard Big-O only hiding constants. In addition, $\odot$ denotes the Hadamard product, ${\mathbb{E}}\left[\cdot\right]$ denotes the expectation operation, and ${\rm{tr}}\left[\cdot\right]$ denotes the trace of a matrix.

Refer to caption — Figure 1: Schematic diagram of a general communication system and its corresponding AE representation.

II System Model

In this section, we first describe the considered system model and then provide a detailed explanation of the problem formulation from two different perspectives in the following sections.

II-A Traditional Communication System

As shown in the upper part of Fig. 1, consider the process of message transmission from the perspective of a typical communication system. We assume that an information source generates a sequence of ${\log_{2}}M$ -bit message symbols ${s}\in\left\{{1,2,\cdots,M}\right\}$ to be communicated to the destination. Then the modulation modules inside the transmitter map each symbol $s$ to a signal ${\bf{x}}\in{\mathbb{R}^{d}}$ , where $d$ denoted the dimension of the signal space. The signal alphabet is denoted by ${{\bf{x}}_{1}},{{\bf{x}}_{2}},\cdots,{{\bf{x}}_{M}}$ . During channel transmission, $d$ -dimensional signal $\bf{x}$ is corrupted to ${\bf{y}}\in{\mathbb{R}^{d}}$ with conditional probability density function (PDF) $p\left({{\bf{y}}|{\bf{x}}}\right)=\prod_{i=1}^{d}p\left({{y_{i}}|{x_{i}}}\right)$ . In this paper, we use $d/2$ bandpass channels, each with separately modulated inphase and quadrature components to transmit the $d$ -dimensional signal [33]. Finally, the received signal is mapped by the demodulation module inside the receiver to ${\hat{s}}$ which is an estimate of the transmitted symbol $s$ . The procedures mentioned above have been exhaustively presented by Shannon.

II-B Understanding Autoencoders on Message Transmission

From the point of view of filtering or signal inference, the idea of AE-based communication system matches Norbert Wiener’s perspective [34]. As shown in the lower part of the Fig. 1, the AE consists of an encoder and a decoder and each of them is a feedforward neural network with parameters (weights and biases) ${{{\bf{\Theta}}_{f}}}$ and ${{{\bf{\Theta}}_{g}}}$ , respectively. Note that each symbol $s$ from information source usually needs to be encoded to a one-hot vector ${\bf{s}}\in{\mathbb{R}^{M}}$ and then is fed into the encoder. Under a given constraint (e.g., average signal power constraint), the PDF of a wireless channel and a loss function to minimize symbol error probability, the encoder and decoder are respectively able to learn to appropriately represent $\bf{s}$ as ${\bf{z}}=f\left({\bf{s}},{{{\bf{\Theta}}_{f}}}\right)$ and to map the corrupted signal $\bf{v}$ to an estimate of transmitted symbol ${\bf{\hat{s}}}=g\left({\bf{v}},{{{\bf{\Theta}}_{g}}}\right)$ where ${\bf{z}},{\bf{v}}\in{\mathbb{R}^{d}}$ . Here, we use ${{\bf{z}}_{1}},{{\bf{z}}_{2}},\cdots,{{\bf{z}}_{M}}$ denoted the transmitted signals from the encoder in order to distinguish it from the transmitted signals from the transmitter.

From the perspective of the whole AE (communication system), it aims to transmit information to a destination with low error probability. The symbol error probability, i.e., the probability that the wireless channel has shifted a signal point into another signal’s decision region, is

{P_{e}}=\frac{1}{M}\sum\limits_{m=1}^{M}{\Pr\left({{\bf{\hat{s}}}\neq{{\bf{s}}_{m}}|{{\bf{s}}_{m}}~{}{\rm{transmitted}}}\right)}.

(1)

The AE can use the cross-entropy loss function

	$\displaystyle{\mathcal{L}_{\log}}\left({{\bf{\hat{s}}},{\bf{s}};{{\bf{\Theta}}_{f}},{{\bf{\Theta}}_{g}}}\right)$	$\displaystyle=-\frac{1}{n}\sum\limits_{i=1}^{n}{\sum\limits_{j=1}^{M}{{{\bf{s}}_{i}}\left[j\right]\log\left({{{{\bf{\hat{s}}}}_{i}}\left[j\right]}\right)}}$		(2)
		$\displaystyle=-\frac{1}{n}\sum\limits_{i=1}^{n}{\log\left({{{{\bf{\hat{s}}}}_{i}}\left[s\right]}\right)}$		(2)

to represent the cost brought by inaccuracy of prediction where ${{{\bf{s}}_{i}}\left[j\right]}$ denotes the $j$ -th element of the $i$ -th symbol in a training set with $n$ symbols. In order to train the AE to minimize the symbol error probability, the optimal parameters could be found by optimizing the loss function

		$\displaystyle\left({{\bf{\Theta}}_{f}^{},{\bf{\Theta}}_{g}^{}}\right)=\mathop{\arg\min}\limits_{\left({{{\bf{\Theta}}_{f}},{{\bf{\Theta}}_{g}}}\right)}\left[{{\mathcal{L}_{\log}}\left({{\bf{\hat{s}}},{\bf{s}};{{\bf{\Theta}}_{f}},{{\bf{\Theta}}_{g}}}\right)}\right]$		(3)
		$\displaystyle{\rm{subject~{}to~{}}}{\mathbb{E}}\left[{\left\\|{\bf{z}}\right\\|_{2}^{2}}\right]\leq{P_{{\rm{av}}}}$		(3)

where $P_{{\rm{av}}}$ denotes the average power. In this paper, we set $P_{{\rm{av}}}={1/M}$ . Now, it is important to explain how does the mapping ${\bf{z}}=f\left({\bf{s}},{{{\bf{\Theta}}_{f}}}\right)$ vary after the training was done.

III Encoder: Finding a Good Representation

Let’s pay attention to the encoder (transmitter). In the domain of communication, an encoder needs to learn a robust representation ${\bf{z}}={f_{{{\bf{\Theta}}_{f}}}}\left({\bf{s}}\right)$ to transmit $\bf{s}$ against channel disturbances, including thermal noise, channel fading, nonlinear distortion, phase jitter, etc. This is equivalent to find a coded (or uncoded) modulation scheme with the signal set of size $M$ to map a symbol $\bf{s}$ to a point $\bf{z}$ for a given transmitted power, which maximizes the minimum distance between any two constellation points. Usually the problem of finding good signal constellations for a Gaussian channel¹¹1The problem of constellation optimization is usually considered under the condition of the Gaussian channel. Although the problem under the condition of Rayleigh fading channel has been studied in [35], its prerequisite is that the side information is perfect known. is associated with the search for lattices with high packing density which is a well-studied problem in the mathematical literature [36]. This issue can be addressed through two different methods as follows.

III-A Traditional Method: Gradient-Search Algorithm

The eminent work of [37] proposed a gradient-search algorithm to obtain the optimum constellations. Consider a zero-mean stationary additive white Gaussian noise (AWGN) channel with one-sided spectral density $2N_{0}$ . For large signal-to-noise ratio (SNR), the asymptotic approximation of (1) can be written as

{P_{e}}\sim\exp\left({-\frac{1}{{8{N_{0}}}}\mathop{\min}\limits_{i\neq j}\left\|{{{\bf{z}}_{i}}-{{\bf{z}}_{j}}}\right\|_{2}^{2}}\right).

(4)

To minimize $P_{e}$ , the problem can be formulated as

		$\displaystyle\left\{{{\bf{z}}_{m}^{*}}\right\}_{m=1}^{M}=\mathop{\arg\min}\limits_{\left\{{{{\bf{z}}_{m}}}\right\}_{m=1}^{M}}\left({{P_{e}}}\right)$		(5)
		$\displaystyle{\rm{subject~{}to~{}}}{\mathbb{E}}\left[{\left\\|{\bf{z}}\right\\|_{2}^{2}}\right]\leq{P_{{\rm{av}}}}$		(5)

where $\left\{{{\bf{z}}_{m}^{*}}\right\}_{m=1}^{M}$ denotes the optimal signal set. This optimization problem can be solved by using a constrained gradient-search algorithm. We denote $\left\{{{\bf{z}}_{m}}\right\}_{m=1}^{M}$ as an $M\times d$ matrix

{\bf{Z}}={\left[{{{\bf{z}}_{1}},{{\bf{z}}_{2}},\cdots,{{\bf{z}}_{M}}}\right]^{T}}.

(6)

Then, the $k$ -th step of the constrained gradient-search algorithm can be described by


	$\displaystyle{\bf{Z}}_{k+1}^{\rm{{}^{\prime}}}={{\bf{Z}}_{k}}-{\eta_{k}}\nabla{P_{e}}\left({{{\bf{Z}}_{k}}}\right)$		(7a)
	$\displaystyle{{\bf{Z}}_{k+1}}=\frac{{{\bf{Z}}_{k+1}^{\rm{{}^{\prime}}}}}{{\sum\limits_{i}{\sum\limits_{j}{{{\left({{\bf{Z}}_{k+1}^{\rm{{}^{\prime}}}\left[{i,j}\right]}\right)}^{2}}}}}}$		(7b)

where $\eta_{k}$ denotes step size and $\nabla{P_{e}}\left({{{\bf{Z}}_{k}}}\right)\in{\mathbb{R}^{M\times d}}$ denotes the gradient of $P_{e}$ respect to the current constellation points. It can be written as

\nabla{P_{e}}\left({{{\bf{Z}}_{k}}}\right)={\left[{{{\bf{g}}_{1}},{{\bf{g}}_{2}},\cdots,{{\bf{g}}_{M}}}\right]^{T}}

(8)

where

{{\bf{g}}_{m}}\sim-\sum\limits_{i\neq m}{\exp\left({-\frac{{\left\|{{{\bf{z}}_{m}}-{{\bf{z}}_{i}}}\right\|_{2}^{2}}}{{8{N_{0}}}}}\right)\left({\frac{1}{{\left\|{{{\bf{z}}_{m}}-{{\bf{z}}_{i}}}\right\|_{2}^{2}}}+\frac{1}{{4{N_{0}}}}}\right){{\bf{1}}_{{{\bf{z}}_{m}}-{{\bf{z}}_{i}}}}}.

(9)

The vector ${{{\bf{1}}_{{{\bf{z}}_{m}}-{{\bf{z}}_{i}}}}}$ denotes $d$ -dimensional unit vector in the direction of ${{{\bf{z}}_{m}}-{{\bf{z}}_{i}}}$ .

Comparing (3) to (5), we can understand the mechanism of the encoder in an AE-based communication system. Its optimized variables of AE method are the parameters ${\Theta_{f}}$ and ${\Theta_{g}}$ . In other words, AE learns the constellation design through simultaneously optimizing the parameters of the encoder and decoder. It does not directly optimize the constellation points ${\bf{s}}$ . Differently, the gradient-search algorithm directly optimizes the ${\bf{s}}$ . Although the optimized variables of these two methods are different, their optimization goals of minimizing the SER are essentially identical. Therefore, the mapping function of encoder can be represented as

\left\{{f\left({{{\bf{s}}_{m}},{\bf{\Theta}}_{f}^{*}}\right)}\right\}_{m=1}^{M}\to\left\{{{\bf{z}}_{m}^{*}}\right\}_{m=1}^{M}

(10)

when the PDF used for generating training samples is multivariate zero-mean normal distribution ${\bf{v}}-{\bf{z}}\sim{{\cal N}_{d}}({\bf{\vec{0}}},{\mkern 1.0mu}{\bm{\Sigma}})$ where ${{\bf{\vec{0}}}}$ denotes $d$ -dimensional zero vector and ${\bm{\Sigma}}=\left({2{N_{0}}/d}\right){\bf{I}}$ is an $d\times d$ diagonal matrix. In the next subsection, detailed explanation is given.

III-B Neural Network-based Method

Unlike gradient-search algorithm, models based on neural networks are created directly from data by an algorithm. However, under most cases, these models are "black boxes", which means that humans, even those who design them, cannot understand how variables are being combined to make predictions [38]. At this stage, the task of wireless communication does not require the interpretability of neural networks. However, the theoretical and comparative analyses of a neural network-based communication system are indispensable since an accurate interpretable model has been built.

Let the one-hot vector ${\bf{s}}\in{\mathbb{R}^{M}}$ be the input ${\bf{x}}$ , ${{\bf{W}}^{\left(1\right)}}\in{\mathbb{R}^{m\times M}}$ is the first weight matrix, ${{\bf{W}}^{\left(h\right)}}\in{\mathbb{R}^{m\times m}}$ is the weight at the $h$ -th layer for $2\leq h\leq{H}$ and $\sigma\left(\cdot\right)$ is the activation function. We assume intermediate layers are square matrices for the sake of simplicity. There are $H_{1}$ and $H_{2}$ hidden layers at the side of transmitter and receiver, respectively. The prediction function can be defined recursively as

		$\displaystyle{{\bf{x}}^{\left(h\right)}}=\sqrt{\frac{{{c_{\sigma}}}}{m}}\sigma\left({{{\bf{W}}^{\left(h\right)}}{{\bf{x}}^{\left({h-1}\right)}}}\right),$		(11)
		$\displaystyle h\in\left[{H}\right]\setminus\left\{{{H_{1}}+1}\right\}\$		(11)

where ${c_{\sigma}}={\left({{\mathbb{E}_{x\sim\mathcal{N}\left({0,1}\right)}}\left[{\sigma{{\left(x\right)}^{2}}}\right]}\right)^{-1}}$ is a scaling factor to normalize the input in the initialization phase, and $\left(H_{1}+1\right)$ -th layer is defined as channel layer. Note that $\left[{H}\right]\setminus\left\{{{H_{1}}+1}\right\}$ is the set of elements that belong to $\left[{H}\right]$ but not to $\left\{{{H_{1}}+1}\right\}$ . To constrain the average power of transmitted signal $P_{{\rm{av}}}$ to ${1/M}$ , ${\bf{x}}^{H_{1}}$ is normalized as

{\bf{z}}=f\left({{\bf{x}},{\Theta_{f}}}\right)=\sqrt{\frac{1}{{\mathbb{E}\left[{\left\|{{{\bf{x}}^{\left({{H_{1}}}\right)}}}\right\|_{2}^{2}}\right]M}}}{{\bf{x}}^{\left({{H_{1}}}\right)}}.

(12)

Then, the effect on the transmitted signal resulting from a wireless channel can be expressed as

{\bf{v}}=h\left({{\bf{z}},{\Theta_{h}}}\right)

(13)

where ${h_{{\Theta_{h}}}}$ is the functional form of the wireless channel with parameter set $\Theta_{h}$ ²²2In accordance with practice, and without introducing ambiguity, $h$ is used to denote hidden layer and channel in the context of neural network and physical communication, respectively.. Let ${{\bf{x}}^{\left(H_{1}+1\right)}}=\bf{v}$ , and finally, the received signal is demodulated as

{\bf{\hat{s}}}=g\left({{{\bf{x}}^{\left({{H_{1}}+1}\right)}},{\Theta_{g}}}\right)={\sigma^{\left(H\right)}}\left({{{\bf{W}}^{\left(H\right)}}{{\bf{x}}^{\left({H-1}\right)}}}\right)

(14)

where ${\sigma^{\left(H\right)}}\left(\cdot\right)={\rm{softmax}}\left(\cdot\right)$ .

To theoretically analyze the neural network, some technical conditions on the activation function are imposed. The first condition is Lipschitz and Smooth. The second is that $\sigma\left(\cdot\right)$ is analytic and is not a polynomial function. In this section, softplus ${\sigma^{\left(h\right)}}\left(z\right)=\log\left({1+\exp\left(z\right)}\right)$ is chosen for the hidden layers except for the ${H_{1}}+1$ and $H$ -th layers.

While training the deep neural network, randomly initialized gradient descent algorithm is used to optimize the empirical loss (2). Specifically, for every layer $h\in\left[{H}\right]\backslash\left\{{{H_{1}}+1}\right\}$ , each entry is sample from a standard Gaussian distribution, ${\bf{W}}_{ij}^{\left(h\right)}\sim\mathcal{N}\left({0,1}\right)$ . Then, the values for parameters can be updated by gradient descent, for $k=1,2,\ldots,$ and $h\in\left[{H}\right]\backslash\left\{{{H_{1}}+1}\right\}$ as

{{\bf{W}}^{\left(h\right)}}\left(k\right)={{\bf{W}}^{\left(h\right)}}\left({k-1}\right)-\eta\frac{{\partial\mathcal{L}_{\rm{log}}\left({\Theta\left({k-1}\right)}\right)}}{{\partial{{\bf{W}}^{\left(h\right)}}\left({k-1}\right)}}

(15)

where $\eta>0$ is the step size.

The update of parameters ${\Theta_{g}}$ at the side of the receiver can be realized as

{\Theta_{g}}\left(k\right)={\Theta_{g}}\left({k-1}\right)-\eta\frac{{\partial{\mathcal{L}_{\log}}\left({\Theta\left({k-1}\right)}\right)}}{{\partial{g_{{\Theta_{g}}}}}}\frac{{\partial{g_{{\Theta_{g}}}}}}{{\partial{\Theta_{g}}}}

(16)

since we know the function ${g_{{\Theta_{g}}}}\left(\cdot\right)$ entirely. At the side of the transmitter, it becomes

{\Theta_{f}}\left(k\right)={\Theta_{f}}\left({k-1}\right)-\eta\frac{{\partial{\mathcal{L}_{\log}}\left({\Theta\left({k-1}\right)}\right)}}{{\partial{g_{{\Theta_{g}}}}}}\frac{{\partial{g_{{\Theta_{g}}}}}}{{\partial{h_{{\Theta_{h}}}}}}\frac{{\partial{h_{{\Theta_{h}}}}}}{{\partial{f_{{\Theta_{f}}}}}}\frac{{\partial{f_{{\Theta_{f}}}}}}{{\partial{\Theta_{f}}}}

(17)

where the terms $\frac{{\partial{g_{{\Theta_{g}}}}}}{{\partial{h_{{\Theta_{h}}}}}}$ and $\frac{{\partial{h_{{\Theta_{h}}}}}}{{\partial{f_{{\Theta_{f}}}}}}$ are difficult to acquire unless the knowledge about the channel ${{h_{{\Theta_{h}}}}}$ is fully known. In this paper, we consider both the Gaussian channel and the Rayleigh flat fading channel.

III-B1 Gaussian Channel

Let ${\bf{n}}\in{\mathbb{R}^{m}}$ is a white Gaussian noise vector and the variance of each entry is $\sigma_{n}^{2}$ . The output of the channel layer can be expressed as

{{\bf{x}}^{\left({{H_{1}}+1}\right)}}={{\bf{W}}^{\left({{H_{1}}+1}\right)}}c_{\sigma}^{\left({{H_{1}}}\right)}{{\bf{x}}^{\left({{H_{1}}}\right)}}+{\bf{n}}

(18)

where ${{\bf{W}}^{\left({{H_{1}}+1}\right)}}={{\bf{I}}_{m}}$ and $c_{\sigma}^{\left({{H_{1}}}\right)}=\sqrt{1/\left({\mathbb{E}\left[{\left\|{{{\bf{x}}^{\left({{H_{1}}}\right)}}}\right\|_{2}^{2}}\right]M}\right)}$ . Then, (18) can be expressed as

{{\bf{x}}^{\left({{H_{1}}+1}\right)}}={{{\bf{W^{\prime}}}}^{\left({{H_{1}}+1}\right)}}{{{\bf{x^{\prime}}}}^{\left({{H_{1}}}\right)}}

(19)

where ${{{\bf{W^{\prime}}}}^{\left({{H_{1}}+1}\right)}}=\left[{{{\bf{I}}_{m}},{\bf{n}}}\right]\in{{\mathbb{R}}^{m\times\left({m+1}\right)}}$ denotes the equivalent weights of the channel layer and ${{{\bf{x^{\prime}}}}^{\left({{H_{1}}}\right)}}=\left[{c_{\sigma}^{\left({{H_{1}}}\right)}{{{\bf{x^{\prime}}}}^{\left({{H_{1}}}\right)}};1}\right]\in{\mathbb{R}^{m+1}}$ . Finally, the terms $\frac{{\partial{g_{{\Theta_{g}}}}}}{{\partial{h_{{\Theta_{h}}}}}}$ and $\frac{{\partial{h_{{\Theta_{h}}}}}}{{\partial{f_{{\Theta_{f}}}}}}$ can be written as


	$\displaystyle\frac{{\partial{g_{{\Theta_{g}}}}}}{{\partial{h_{{\Theta_{h}}}}}}=\frac{{\partial{g_{{\Theta_{g}}}}}}{{\partial{{{\bf{W^{\prime}}}}^{\left({H_{1}+1}\right)}}}}$		(20a)
	$\displaystyle\frac{{\partial{h_{{\Theta_{h}}}}}}{{\partial{f_{{\Theta_{f}}}}}}=\frac{{\partial{{{\bf{W^{\prime}}}}^{\left({{H_{1}}+1}\right)}}}}{{\partial{f_{{\Theta_{f}}}}}}.$		(20b)

Substituting 20 into (17), we get

	$\displaystyle{\Theta_{f}}\left(k\right)=$	$\displaystyle{\Theta_{f}}\left({k-1}\right)-$		(21)
		$\displaystyle\eta\frac{{\partial{L_{\log}}\left({\Theta\left({k-1}\right)}\right)}}{{\partial{g_{{\Theta_{g}}}}}}\frac{{\partial{g_{{\Theta_{g}}}}}}{{\partial{{{\bf{W^{\prime}}}}^{\left({{H_{1}}+1}\right)}}}}\frac{{\partial{{{\bf{W^{\prime}}}}^{\left({{H_{1}}+1}\right)}}}}{{\partial{f_{{\Theta_{f}}}}}}\frac{{\partial{f_{{\Theta_{f}}}}}}{{\partial{\Theta_{f}}}}.$		(21)

III-B2 Fading Channel

We consider a Rayleigh flat fading channel for simplicity. It is not difficult to generalize our analysis to other fading channels, e.g., frequency selective fading channels.

To transmitted the modulated signal ${{\bf{x^{\prime}}}^{\left({{H_{1}}}\right)}}$ , $m/2$ bandpass channels are needed. We assume that the channel impulse of each channel is mutually independent, i.e., ${{{\bf{W^{\prime}}}}^{\left({{H_{1}}+1}\right)}}=\left[{{\bf{H}},{\bf{n}}}\right]\in{{}^{m\times\left({m+1}\right)}}$ where

{\bf{H}}=\left[{\begin{array}[]{*{20}{c}}{{h_{1,{\rm{I}}}}}&0&\cdots&0\\ 0&{{h_{1,{\rm{Q}}}}}&\cdots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\cdots&{{h_{m/2,{\rm{Q}}}}}\end{array}}\right],

(22)

and $\left({{h_{1,{\rm{I}}}},{h_{1,{\rm{Q}}}},\ldots,{h_{m/2,{\rm{Q}}}}}\right)\sim{{\mathcal{N}}_{m}}\left({0,{{\bf{I}}_{m}}}\right)$ . The real and imaginary parts of the channel impulse of the $i$ -th bandpass channel are denoted as ${h_{i,{\rm{I}}}}$ and ${h_{i,{\rm{Q}}}}$ , respectively. In this case, the parameters update of parameters at this side of transmitter has the same form as (21).

III-C Properties of Convergence

The convergence properties of the traditional algorithm, i.e., gradient-search algorithm, have been exhaustively analyzed in [37]. Whereas, the properties of the AE-abased algorithm have not been studied yet, and therefore, we mainly try to analyze the convergence properties of the AE-based algorithm in this subsection.

In [39], Simon S. Du et al. analyze two-layer fully connected ReLU activated neural networks. It has been shown that, with over-parameterization, gradient descent provably converges to the global minimum of the empirical loss at a linear convergence rate. Then, they develop a unified proof strategy for the fully-connected neural network, ResNet and convolutional ResNet [40].

We replace (2) with square loss function as

{\mathcal{L}_{2}}=\frac{1}{2}\sum\limits_{i=1}^{n}{{{\left({{{{\bf{\hat{s}}}}_{i}}\left[s\right]-1}\right)}^{2}}}

(23)

Then, the individual prediction at the $k$ -th iteration can be denoted as

{{\hat{s}}_{i}}={g_{{\Theta_{g}}}}\left({{f_{{\Theta_{f}}}}\left({{{\bf{s}}_{i}}}\right)}\right)\left[s\right].

(24)

Let ${\bf{\hat{s}}}\left(k\right)={\left({{{\hat{s}}_{1}}\left(k\right),\ldots,{{\hat{s}}_{n}}\left(k\right)}\right)^{T}}\in{\mathbb{R}^{n}}$ where $n$ denotes the size of the training set. Simon S. Du et al. [39, 40] show that for DNN, the sequence $\left\{{{\bf{1}}-{\bf{\hat{s}}}\left(k\right)}\right\}_{k=0}^{\infty}$ admits the dynamics

{\bf{1}}-{\bf{\hat{s}}}\left({k+1}\right)=\left({{\bf{I}}-\eta{\bf{G}}\left(k\right)}\right)\left({{\bf{1}}-{\bf{\hat{s}}}\left(k\right)}\right)

(25)

where

$\displaystyle{{\bf{G}}_{ij}}\left(k\right)$	$\displaystyle=\left\langle{\frac{{\partial{{\hat{s}}_{i}}\left(k\right)}}{{\partial\Theta\left(k\right)}},\frac{{\partial{{\hat{s}}_{j}}\left(k\right)}}{{\partial\Theta\left(k\right)}}}\right\rangle$	(26)
	$\displaystyle=\sum\limits_{h=1}^{H}{\left\langle{\frac{{\partial{{\hat{s}}_{i}}\left(k\right)}}{{\partial{{\bf{W}}^{\left(h\right)}}\left(k\right)}},\frac{{\partial{{\hat{s}}_{j}}\left(k\right)}}{{\partial{{\bf{W}}^{\left(h\right)}}\left(k\right)}}}\right\rangle}$
	$\displaystyle\buildrel\Delta\over{=}\sum\limits_{h=1}^{H}{{\bf{G}}_{ij}^{\left(h\right)}\left(k\right)}.$

The Gram matrix induced by the weights from $h$ -th layer ${{\bf{G}}^{\left(h\right)}}\in{\mathbb{R}^{n\times n}}$ is defined as ${\bf{G}}_{ij}^{\left(h\right)}\left(k\right)=\left\langle{\frac{{\partial{{\hat{s}}_{i}}\left(k\right)}}{{\partial{{\bf{W}}^{\left(h\right)}}\left(k\right)}},\frac{{\partial{{\hat{s}}_{j}}\left(k\right)}}{{\partial{{\bf{W}}^{\left(h\right)}}\left(k\right)}}}\right\rangle$ for $h=1,\ldots,H$ . Note for all $h\in\left[H\right]$ , each entry of ${{\bf{G}}^{\left(h\right)}}$ is an inner product.

In [40], it has been shown that, if the width is large enough for all layers, for all $k=0,1,\ldots$ , ${{\bf{G}}^{H}}\left(k\right)$ is close to a fixed matrix ${{\bf{K}}^{\left(H\right)}}\in{\mathbb{R}^{n\times n}}$ which depends on the input data, neural network architecture and the activation but does not depend on neural network parameters $\Theta$ . The Gram matrix ${{\bf{K}}^{\left(H\right)}}$ is recursively defined as follows. For $\left({i,j}\right)\in\left[n\right]\times\left[n\right]$ and $h\in\left[{H-1}\right]$ ,


	$\displaystyle{\bf{K}}_{ij}^{\left(0\right)}=\left\langle{{{\bf{x}}_{i}},{{\bf{x}}_{j}}}\right\rangle,$		(27a)
	$\displaystyle{\bf{A}}_{ij}^{\left(h\right)}=\left({\begin{array}[]{*{20}{c}}{{\bf{K}}_{ii}^{\left({h-1}\right)}}&{{\bf{K}}_{ij}^{\left({h-1}\right)}}\\ {{\bf{K}}_{ji}^{\left({h-1}\right)}}&{{\bf{K}}_{jj}^{\left({h-1}\right)}}\end{array}}\right),$		(27d)
	$\displaystyle{\bf{K}}_{ij}^{\left(h\right)}={c_{\sigma}}{\mathbb{E}_{{{\left({u,v}\right)}^{T}}}}\sim\mathcal{N}\left({{\bf{0}},{\bf{A}}_{ij}^{\left(h\right)}}\right)\left[{\sigma\left(u\right)\sigma\left(v\right)}\right],$		(27e)
	$\displaystyle{\bf{K}}_{ij}^{\left(H\right)}={c_{\sigma}}{\bf{K}}_{ij}^{\left({H-1}\right)}{\mathbb{E}_{{{\left({u,v}\right)}^{T}}}}\sim\mathcal{N}\left({{\bf{0}},{\bf{A}}_{ij}^{\left(h\right)}}\right)\left[{\sigma^{\prime}\left(u\right)\sigma^{\prime}\left(v\right)}\right].$		(27f)

However, the existence of the channel layer obstructs this recursive process. Specifically, under the case of Gaussian channel, ${{\bf{W^{\prime}}}^{\left({{H_{1}}+1}\right)}}=\left[{{{\bf{I}}_{m}},{\bf{n}}}\right]$ results that the strictly positive definiteness of matrix ${{\bf{K}}^{\left(H\right)}}$ may not be guaranteed. This leads that the dynamics of gradient descent does not enjoy a linear convergence rate as (25) shown.

Under the case of Rayleigh flat fading channel, ${{\bf{W^{\prime}}}^{\left({{H_{1}}+1}\right)}}=\left[{{\bf{H}},{\bf{n}}}\right]$ makes the situation worse. Since the fading channel is not static, the diagonal elements of the weights matrix ${\bf{H}}\left(k\right)$ in the ${{\bf{W^{\prime}}}^{\left({{H_{1}}+1}\right)}}$ at $k$ -th iterative are a random sample from ${{\cal N}_{m}}\left({0,{{\bf{I}}_{m}}}\right)$ . At the stage of random initialization, suppose we have some perturbation ${\left\|{{{\bf{G}}^{\left(1\right)}}\left(0\right)-{{\bf{K}}^{\left(1\right)}}}\right\|_{2}}\leq{\varepsilon_{1}}$ in the first layer. This perturbation propagates to the $H$ -th layer admits the form

{\left\|{{{\bf{G}}^{\left(H\right)}}\left(0\right)-{{\bf{K}}^{\left(H\right)}}}\right\|_{2}}\buildrel\Delta\over{=}{\varepsilon_{H}}\mathop{<}\limits_{\sim}{2^{O\left(H\right)}}{\varepsilon_{1}}.

(28)

Therefore, we need to have ${\varepsilon_{1}}\leq 1/{2^{O\left(H\right)}}$ and this makes $m$ have exponential dependency on $H$ [40]. Moreover, at the training stage, the perturbation in the $\left({{H_{1}}{\rm{+1}}}\right)$ -th layer induced by fading channel can disperse to the whole network, i,e., the averaged Frobenius norm

\frac{1}{{\sqrt{m}}}{\left\|{{{\bf{W}}^{\left(h\right)}}\left(k\right)-{{\bf{W}}^{\left(h\right)}}\left(0\right)}\right\|_{F}}

(29)

is not small for all $k=0,1,\ldots$ .

In addition, large biases $\bf{n}$ would be introduced by the channel noise when SNR is low. Although the biases do not impact the weights directly, its influence can be spread to the whole network through forward and backward propagation.

IV Decoder: Inference

In this section, we will zoom in the lower right corner of the Fig. 1 to investigate what happens inside the decoder (receiver). As Fig. 2(a) shows, for the task of DNN-based channel estimate, the problem can be formulated as an inference model. For the sake of convenience, we can denote the target output of the decoder as $\bf{z}$ instead of $\bf{s}$ because we can assume ${\bf{z}}={f_{{{\bf{\Theta}}_{f}}}}\left({\bf{s}}\right)$ is bijection. If the decoder is symmetric, the decoder also can be seen as a sub AE which consists of a sub encoder and decoder. Its bottleneck (or middlemost) layer codes is denoted as $\bf{u}$ . Here we use $\bf{z}$ to denote CSI or transmitted symbol which we desire to predict. The decoder infers a prediction ${\bf{\hat{z}}}={g_{{{\bf{\Theta}}_{g}}}}\left({\bf{v}}\right)$ according to its corresponding measurable variable $\bf{v}$ . This inference problem can be addressed by the following two different methods.

IV-A Traditional Estimation Method

IV-A1 LS Estimator

Without loss of generality, we can consider this issue under the context of complex channel estimation. Let the measurable variable be

{\bf{v}}={{{\bf{\hat{h}}}}_{{\rm{LS}}}}={\bf{h}}+{\bf{n}}

(30)

where ${{{\bf{\hat{h}}}}_{{\rm{LS}}}}$ denotes the LS estimation of its corresponding true channel responses ${\bf{h}}\in{\mathbb{C}^{d}}$ , and ${\bf{n}}\in{\mathbb{C}^{d}}$ is a vector of i.i.d. complex zero-mean Gaussian noise with variance $\sigma_{n}^{2}$ . The noise $\bf{n}$ is assumed to be uncorrelated with the channel $\bf{h}$ . The corresponding MSE is

{\rm{MS}}{{\rm{E}}_{{\rm{LS}}}}={\mathbb{E}}\left[{\left\|{{\bf{h}}-{{\bf{h}}_{{\rm{LS}}}}}\right\|_{2}^{2}}\right]=d\sigma_{n}^{2}.

(31)

IV-A2 LMMSE Estimator

If the channel is stationary and subject to ${\bf{h}}\sim{\mathcal{N}_{\mathcal{C}}}\left({{\bf{0}},{{\bf{R}}_{{\bf{hh}}}}}\right)$ , its LMMSE estimate can be expressed as

{{{\bf{\hat{h}}}}_{{\rm{LMMSE}}}}={{\bf{R}}_{{\bf{hh}}}}{\left({{{\bf{R}}_{{\bf{hh}}}}+\sigma_{n}^{2}{{\left({{\bf{X}}{{\bf{X}}^{H}}}\right)}^{-1}}}\right)^{-1}}{{{\bf{\hat{h}}}}_{{\rm{LS}}}}

(32)

where ${{\bf{R}}_{{\bf{hh}}}}=E\left\{{{\bf{h}}{{\bf{h}}^{H}}}\right\}$ is the channel autocorrelation matrix and ${\bf{X}}$ is a diagonal matrix containing the known transmitted signaling points [41, 42, 43]. The MSE of the LMMSE estimate is

	$\displaystyle{\rm{MS}}{{\rm{E}}_{{\rm{LMMSE}}}}$	$\displaystyle=\mathbb{E}\left[{\left\\|{{\bf{h}}-{{\bf{h}}_{{\rm{LMMSE}}}}}\right\\|_{2}^{2}}\right]$		(33)
		$\displaystyle={\rm{tr}}\left[{{{\bf{R}}_{{\bf{hh}}}}{{\left({{{\bf{I}}_{d}}+\frac{1}{{\sigma_{n}^{2}}}{{\bf{R}}_{{\bf{hh}}}}}\right)}^{-1}}}\right]\leq{\rm{MS}}{{\rm{E}}_{{\rm{LS}}}}.$		(33)

To perform (32), ${{{\bf{R}}_{{\bf{hh}}}}}$ and ${\sigma_{n}^{2}}$ are assumed to be known.

We define a complex space $\cal{G}$ . Every element of $\cal{G}$ is a finite-variance, zero-mean, proper complex Gaussian random variable, and every subset of $\cal{G}$ is jointly Gaussian. The set of observed variables and its closure or the subspace generated by ${\cal{V}}$ are denoted as $\cal{V}\in\cal{G}$ and ${\bar{\cal{V}}}$ , respectively. Let ${\bf{e}}={\bf{h}}-{\bf{\hat{h}}}$ be the estimation error where ${\bf{\hat{h}}}\in\bar{\cal{V}}$ is a linear estimate of $\bf{h}$ . By the projection and Pythagorean theorems, we have

\begin{aligned} {\left\|{\bf{e}}\right\|^{2}}&={\left\|{{{\bf{h}}_{|\cal V}}+{{\bf{h}}_{\bot\cal V}}-{\bf{\hat{h}}}}\right\|^{2}}={\left\|{{{\bf{h}}_{|\cal V}}-{\bf{\hat{h}}}}\right\|^{2}}+{\left\|{{{\bf{h}}_{\bot\cal V}}}\right\|^{2}}\\ &\geq{\left\|{{{\bf{h}}_{\bot\cal V}}}\right\|^{2}}\end{aligned},

(34)

with equality if and only if ${\bf{\hat{h}}}={{\bf{h}}_{|\cal V}}$ . For this reason, ${{{\bf{h}}_{|\cal V}}}$ is called the LMMSE estimate of $\bf{h}$ given $\bf{v}$ , and ${{{\bf{h}}_{\bot\cal V}}}$ is called the MMSE estimation error. Moreover, the orthogonality principle holds: ${\bf{\hat{h}}}\in\bar{\cal V}$ is the LMMSE estimate of $\bf{h}$ given $\bf{v}$ if and only if ${\bf{h}}-{\bf{\hat{h}}}$ is orthogonal to $\bar{\cal{V}}$ . Writing ${{{\bf{\hat{h}}}}_{{\rm{LMMSE}}}}$ as a set of linear combinations of the elements of $\bf{v}$ in matrix form, namely ${{{\bf{\hat{h}}}}_{{\rm{LMMSE}}}}={{\bf{A}}_{{\bf{hv}}}}{\bf{v}}$ , we obtain a unique solution ${{\bf{A}}_{{\bf{hv}}}}={{\bf{R}}_{{\bf{hv}}}}{\bf{R}}_{{\bf{vv}}}^{-1}$ . Fig. 3 illustrates that $\bf{h}$ can be decomposed into a linear estimate derived from $\bf{v}$ and an independent error variable ${\bf{e}}={{\bf{h}}_{\bot{\cal V}}}$ . Moreover, this block diagram implies that the MMSE estimate ${{\bf{h}}_{|\cal V}}$ is a sufficient statistic for estimation of $\bf{h}$ from $\bf{v}$ , since ${\bf{v}}-{{\bf{h}}_{|v}}-{\bf{h}}$ is evidently a Markov chain; i.e., $\bf{v}$ and $\bf{h}$ are conditionally independent given ${{\bf{h}}_{|\cal V}}$ . This is also known as the sufficiency property of the MMSE estimate [44].

By the sufficiency property, the MMSE estimate ${{\bf{h}}_{|\cal V}}$ is a function of $\bf{v}$ that satisfies the data processing inequality of information theory with equality: $I\left({{\bf{h}};{\bf{v}}}\right)=I\left({{\bf{h}};{{\bf{h}}_{|\cal V}}}\right)$ ³³3Note that no confusion should arise if we abuse the notation slightly by using a lower-case letter to denote a random variable.. In other words, the reduction of $\bf{v}$ to ${{\bf{h}}_{|\cal V}}$ is information-lossless. Since ${\bf{h}}={{\bf{A}}_{{\bf{hv}}}}{\bf{v}}+{{\bf{h}}_{\bot{\cal V}}}$ is a linear Gaussian channel model with Gaussian input $\bf{v}$ , Gaussian output $\bf{h}$ , and independent additive Gaussian noise ${\bf{e}}={{\bf{h}}_{\bot{\cal V}}}$ , we have

I\left({{\bf{h}};{\bf{v}}}\right)=h\left({\bf{h}}\right)-h\left({{\bf{h}}|{\bf{v}}}\right)=h\left({\bf{h}}\right)-h\left({\bf{e}}\right)=\log\frac{{\left|{{{\bf{R}}_{{\bf{hh}}}}}\right|}}{{\left|{{{\bf{R}}_{{\bf{ee}}}}}\right|}},

(35)

where

$\displaystyle{{\bf{R}}_{{\bf{ee}}}}$	$\displaystyle=\mathbb{E}\left[{\left({{\bf{h}}-{{{\bf{\hat{h}}}}_{{\bf{LMMSE}}}}}\right){{\left({{\bf{h}}-{{{\bf{\hat{h}}}}_{{\bf{LMMSE}}}}}\right)}^{H}}}\right]$	(36)
	$\displaystyle=\mathbb{E}\left[{\left({{\bf{h}}-{{{\bf{\hat{h}}}}_{{\bf{LMMSE}}}}}\right){{\bf{h}}^{H}}}\right]+\mathbb{E}\left[{\left({{\bf{h}}-{{{\bf{\hat{h}}}}_{{\bf{LMMSE}}}}}\right){{\bf{h}}^{H}}{\bf{A}}_{{\bf{hv}}}^{H}}\right]$
	$\displaystyle=\mathbb{E}\left[{\left({{\bf{h}}-{{{\bf{\hat{h}}}}_{{\bf{LMMSE}}}}}\right){{\bf{h}}^{H}}}\right]$
	$\displaystyle={{\bf{R}}_{{\bf{hh}}}}-{{\bf{A}}_{{\bf{hv}}}}{{\bf{R}}_{{\bf{vh}}}}.$

IV-B DNN-based Estimation

Let ${\cal H}_{k,{l}}^{\sigma}$ be the hypothesis class associated with a $l$ -layer neural network of size $k$ with activation function $\sigma$ . We assume that $\left|{\sigma\left(z\right)}\right|\leq 1$ for all $z\in\mathbb{R}$ . More specifically, ${\cal H}_{k,{l}}^{\sigma}$ is the set of all functions $h:{{\mathbb{R}}^{d}}\to{{\mathbb{R}}^{d}}$ . Given a random sample $D_{n}=\left\{{\left({{{\bf{v}}_{1}},{{\bf{z}}_{1}}}\right),\ldots,\left({{{\bf{v}}_{n}},{{\bf{z}}_{n}}}\right)}\right\}$ , we define

{\rm{MMS}}{{\rm{E}}_{k,l,n}}\left({{\bf{z}}|{\bf{v}}}\right):=\mathop{\inf}\limits_{h\in H_{k,l}^{\sigma}}\frac{1}{n}\sum\limits_{i=1}^{n}{\left\|{{{\bf{z}}_{i}}-h\left({{{\bf{v}}_{i}}}\right)}\right\|_{2}^{2}},

(37)

i.e., ${\rm{MMS}}{{\rm{E}}_{k,l,n}}\left({{\bf{z}}|{\bf{v}}}\right)$ is the minimum empirical square loss attained by a $l$ -layer neural network of size $k$ . Mario Diaz et al. establishes a probabilistic bound for the MMSE in estimating ${\bf{z}}\in{\mathbb{R}^{1}}$ given ${\bf{v}}\in{\mathbb{R}^{d}}$ based on the 2-layer estimator ${\rm{MMS}}{{\rm{E}}_{k,2,n}}\left({{\bf{z}}|{\bf{v}}}\right)$ , and the Barron constant of the conditional expectation of $\bf{z}$ given $\bf{v}$ [45, Theorem 1]. We extend this probabilistic bound to explore the MSE performance of DNN-based estimators.

Assumption 1

Let $k,l,n\in\mathbb{N}$ and $B\in\mathbb{R}^{d}$ be a bounded set containing $0$ . If each entry of $\bf{z}$ belongs to $\left[-1,1\right]$ , $\bf{v}$ is supported on $B$ , and the conditional expectation $\eta\left({\bf{v}}\right):={\mathbb{E}}\left[{{\bf{z}}|{\bf{v}}}\right]$ belongs to ${\Gamma_{B}}$ , the set of all function $h:B\to{\mathbb{R}^{d}}$ , then, with probability at least $1-\delta$ ,

{\rm{MMS}}{{\rm{E}}_{k,l,n}}\left({{\bf{z}}|{\bf{v}}}\right)-{\varepsilon_{k,l,n,\delta}}\leq{\rm{MMSE}}\left({{\bf{z}}|{\bf{v}}}\right)

(38)

where

	$\displaystyle{\rm{MMSE}}\left({{\bf{z}}\|{\bf{v}}}\right):$	$\displaystyle=\mathop{\inf}\limits_{h~{}{\rm{meas}}{\rm{.}}}\mathbb{E}\left[{{{\left({{\bf{z}}-h\left({\bf{v}}\right)}\right)}^{2}}}\right]$		(39)
		$\displaystyle=\mathbb{E}\left[{{{\left({{\bf{z}}-\eta\left({\bf{v}}\right)}\right)}^{2}}}\right].$		(39)

Theorem 1

Under Assumption 39, for a linear estimation problem, given a limit $k,n$ , and a specific training set $D_{n}$ , if $k\geq d$ , and ${\rm{MMS}}{{\rm{E}}_{k,2,n}}\left({{\bf{z}}|{\bf{v}}}\right)\neq{\rm{MMSE}}\left({{\bf{z}}|{\bf{v}}}\right)$ , then

{\varepsilon_{k,2,n,\delta}}\leq{\varepsilon_{k,3,n,\delta}}\leq\ldots\leq{\varepsilon_{k,l,n,\delta}}.

(40)

Proof 1

According to Assumption 39, for a 2-layer neural network, with probability at least $1-\delta$ ,

{\rm{MMS}}{{\rm{E}}_{k,2,n}}\left({{\bf{z}}|{\bf{v}}}\right)-{\varepsilon_{k,2,n,\delta}}<{\rm{MMSE}}\left({{\bf{z}}|{\bf{v}}}\right).

(41)

If ${\rm{MMS}}{{\rm{E}}_{k,2,n}}\left({{\bf{z}}|{\bf{v}}}\right)\neq{\rm{MMSE}}\left({{\bf{z}}|{\bf{v}}}\right)$ , the output of the trained 2-layer neural network $h_{k,2}^{*}\left({\bf{v}}\right)$ is not a sufficient statistic for estimation of $\bf{z}$ from $\bf{v}$ . By the data processing inequality, we have

I\left({\bf{z}};h_{k,2}^{*}\left({\bf{v}}\right)\right)\geq I\left({\bf{z}};h_{k,3}^{*}\left({\bf{v}}\right)\right)\geq...\geq I\left({\bf{z}};h_{k,l}^{*}\left({\bf{v}}\right)\right),

(42)

and

{\varepsilon_{k,2,n,\delta}}\leq{\varepsilon_{k,3,n,\delta}}\leq\ldots\leq{\varepsilon_{k,l,n,\delta}}.

(43)

IV-C Information Flow in Neural Networks

If the joint probability distribution $p\left({{\bf{v}},{\bf{z}}}\right)$ is known, the expected (population) risk ${{\cal C}_{p\left({{\bf{v}},{\bf{z}}}\right)}}\left({{g_{{{\bf{\Theta}}_{g}}}},{{\cal L}_{\log}}}\right)$ can be written as

$\displaystyle{\mathbb{E}}\left[{{{\cal L}_{\log}}\left({{\bf{\hat{z}}},{\bf{z}};{{\bf{\Theta}}_{g}}}\right)}\right]$	$\displaystyle=\sum\limits_{{\bf{v}}\in\mathcal{V},{\bf{z}}\in\mathcal{Z}}{p\left({{\bf{v}},{\bf{z}}}\right)\log\left({\frac{1}{{Q\left({{\bf{z}}\|{\bf{v}}}\right)}}}\right)}$	(44)
	$\displaystyle=\sum\limits_{{\bf{v}}\in\mathcal{V},{\bf{z}}\in\mathcal{Z}}{p\left({{\bf{v}},{\bf{z}}}\right)\log\left({\frac{1}{{p\left({{\bf{z}}\|{\bf{v}}}\right)}}}\right)}{\rm{+}}$
	$\displaystyle~{}~{}~{}\sum\limits_{{\bf{v}}\in\mathcal{V},{\bf{z}}\in\mathcal{Z}}{p\left({{\bf{v}},{\bf{z}}}\right)\log\left({\frac{{p\left({{\bf{z}}\|{\bf{v}}}\right)}}{{Q\left({{\bf{z}}\|{\bf{v}}}\right)}}}\right)}$
	$\displaystyle=H\left({{\bf{z}}\|{\bf{v}}}\right)+{D_{{\rm{KL}}}}\left({p\left({{\bf{z}}\|{\bf{v}}}\right)\|\|Q\left({{\bf{z}}\|{\bf{v}}}\right)}\right)$
	$\displaystyle\geq H\left({{\bf{z}}\|{\bf{v}}}\right)$

where $Q\left({\cdot|{\bf{v}}}\right){\rm{=}}{g_{{{\bf{\Theta}}_{g}}}}\left({\bf{v}}\right)\in p\left({\mathcal{Z}}\right)$ and ${D_{{\rm{KL}}}}\left({p\left({{\bf{z}}|{\bf{v}}}\right)||Q\left({{\bf{z}}|{\bf{v}}}\right)}\right)$ denotes Kullback-Leibler divergence between ${p\left({{\bf{z}}|{\bf{v}}}\right)}$ and ${Q\left({{\bf{z}}|{\bf{v}}}\right)}$ [29]⁴⁴4If $\bf{z}$ and $\bf{v}$ are continuous random variables, the sum becomes an integral when their PDFs exist.. If and only if the decoder is given by the conditional posterior ${g_{{{\bf{\Theta}}_{g}}}}\left({\bf{v}}\right){\rm{=}}p\left({{\bf{z}}|{\bf{v}}}\right)$ , the expected (population) risk reaches the minimum $\mathop{\min}\limits_{{g_{{{\bf{\Theta}}_{g}}}}}{{\cal C}_{p\left({{\bf{v}},{\bf{z}}}\right)}}\left({{g_{{{\bf{\Theta}}_{g}}}},{{\cal L}_{\log}}}\right)=H\left({{\bf{z}}|{\bf{v}}}\right)$ .

In physical layer communication, instead of perfectly knowing the channel-related joint probability distribution $p\left({{\bf{v}},{\bf{z}}}\right)$ , we only have a set of $n$ i.i.d. samples ${D_{n}}:=\left\{{\left({{{\bf{v}}_{i}},{{\bf{z}}_{i}}}\right)}\right\}_{i=1}^{n}$ from $p\left({{\bf{v}},{\bf{z}}}\right)$ . In this case, the empirical risk is defined as

{\hat{\cal C}_{p\left({{\bf{v}},{\bf{z}}}\right)}}\left({{g_{{{\bf{\Theta}}_{g}}}},{\cal L},{{\cal D}_{n}}}\right)=\frac{1}{n}\sum\limits_{i=1}^{n}{{\cal L}\left[{{{\bf{z}}_{i}},{g_{{{\bf{\Theta}}_{g}}}}\left({{{\bf{v}}_{i}}}\right)}\right]}.

(45)

Practically, the ${\mathcal{D}}_{n}$ from $p\left({{\bf{v}},{\bf{z}}}\right)$ usually is a finite set. This leads the difference between the empirical and expected (population) risks which can be defined as

	$\displaystyle{\rm{ge}}{{\rm{n}}_{p\left({{\bf{v}},{\bf{z}}}\right)}}\left({{g_{{{\bf{\Theta}}_{g}}}},{\cal L},{{\cal D}_{n}}}\right)=$	$\displaystyle{{\cal C}_{p\left({{\bf{v}},{\bf{z}}}\right)}}\left({{g_{{{\bf{\Theta}}_{g}}}},{{\cal L}_{\log}}}\right)-$		(46)
		$\displaystyle{\hat{\cal C}_{p\left({{\bf{v}},{\bf{z}}}\right)}}\left({{g_{{{\bf{\Theta}}_{g}}}},{\cal L},{{\cal D}_{n}}}\right).$		(46)

We now can preliminarily conclude that the DNN-based receiver is an estimator with minimum empirical risk for a given set ${\mathcal{D}}_{n}$ , whereas its performance is inferior to the optimal with minimum expected (population) risk under a given joint probability distribution $p\left({{\bf{v}},{\bf{z}}}\right)$ .

Furthermore, it is necessary to quantitatively understand how information flows inside the decoder. Fig. 2(b) shows the graph representation of the decoder where ${{{\bf{t}}_{i}}}$ and ${{{\bf{t^{\prime}}}}_{i}}\left({1\leq i\leq S}\right)$ denote $i$ -th hidden layer representations starting from the input layer and the output layer, respectively. Usually, the Shannon’s entropy cannot be calculated directly since the exact joint probability distribution of two variables is difficult to acquire. Therefore, we use the method proposed in [46] to illustrate layer-wise mutual information by three kinds of information planes (IPs) where the Shannon’s entropy is estimated by matrix-based functional of Rényi’s $\alpha$ -entropy [47]. Its details are given in Appendix.

V Simulation Results

In this section, we provide simulation results to illustrate the behaviour of DNN in physical layer communication.

V-A Constellation and Convergence of AE-based Communication System

V-A1 Gaussian Channel

TABLE I: Layout of the autoencoder

Layer	Output dimensions
Input	$M$
Dense+ReLU	$M$
Dense+linear	$d$
Normalization	$d$
Channel	$d$
Dense+ReLU	$M$
Dense+softmax	$M$

Fig. 4(a) and Fig. 4(b) visualize the Frobenius norm of each layer of a AE-based communication system for $d=2$ and $M=8$ versus epochs under Gaussian channel. The Layout of the AEs are provided in Table I. When SNR=0 dB, the Frobenius norms of layers at the transmitter side increase with the epoch, and that at the receiver side are kept in small values and do not change significantly. In contrast to the case of low SNR, the Frobenius norms of all the layers close to convergence after $4\times 10^{4}$ epochs when SNR=25 dB. This phenomenon can be explained by our analysis in Section III. If SNR is low, large biases would be introduced into the channel layer by noise since ${{{\bf{W^{\prime}}}}^{\left({{H_{1}}+1}\right)}}=\left[{{{\bf{I}}_{m}},{\bf{n}}}\right]$ . This leads to the exploding gradient problem at the transmitter side. At the receiver side, the Frobenuis norm of each layer tends to be small to counter the large noise, but it produces a very little effect. From Fig. 5, it can be seen that all the signal points are on a line, and two of them are almost overlapped. A number of simulations have been conducted, and the constellations generated by the AE with different initial parameters keep in the same pattern that all constellation points are in a straight line.

Fig. 6(a) and Fig. 7(a) show the optimum constellations obtained by gradient-search technique proposed in [37]. When $d=2$ and $3$ , the algorithm was run allowing for 1000 and 3000 steps, respectively. Several random initialization constellations (initial points selected according to a uniform probability density of the unit disk) are used for each value of $d$ and $M$ . The step size $\eta=2\times{10^{-4}}$ . Numerous local optima are found, many of which are merely rotations or other symmetric modifications of each other. Fig. 6(b) and Fig. 7(b) show the constellations produced by AEs which have the same network structures and hyperparameters with the AEs mentioned in [7] (see also TABLE I). The AEs were trained with ${10^{6}}$ epochs, each of which contains $M$ different symbols. Several simulations have been conducted, and it can be found that AE does not ensure that the optimum constellation whereas the gradient-search technique has a high probability of finding an optimum.

When $d=2$ , the two-dimensional constellations produced by AEs have a similar pattern to the optimum constellations which form a lattice of (almost) hexagonal. Specifically, in the case of $\left({d=2,\;M=8}\right)$ , one of the constellations found by the AE can be obtained by rotating the optimum constellation found by gradient-search technique. In the case of $\left({d=2,\;M=16}\right)$ , the constellation found by the AE is different from the optimum constellation but it still forms a lattice of (almost) equilateral triangles. In the case of $\left({d=3,~{}M=16}\right)$ , one signal point of the optimum constellation found by gradient-search technique is almost at the origin while the other 15 signal points are almost at the surface of a sphere with radius $P_{{\rm{av}}}$ and centre 0. This pattern is similar to the surface of a truncated icosahedron which is composed of pentagonal and hexagonal faces. However, the three-dimensional constellation produced by an AE is a local optima which is formed by 16 signal points almost in a plane.

From the perspective of computational complexity, the cost to train an AE is significantly higher than the cost of traditional algorithm. Specifically, an AE which has 4 dense layers respectively with $M$ , $d$ , $M$ and $M$ neurons needs to train $\left({2M+1}\right)\left({M+d}\right)+2M$ parameters for ${10^{6}}$ epochs whereas the gradient-search algorithm only needs $2M$ trainable parameters for ${10^{3}}$ steps.

V-A2 Rayleigh Flat Fading Channel

In the case of Rayleigh flat fading channel, multiplicative perturbation is introduced since ${{{\bf{W^{\prime}}}}^{\left({{H_{1}}+1}\right)}}=\left[{{\bf{H}},{\bf{n}}}\right]$ . Fig. 8(a) illustrates that the exploding gradient problem occurs at the side of transmitter under the case of Rayleigh flat fading channel with SNR= 25dB, which is similar to the case of low SNR Gaussian channel. It means that fading leads the AE not to convergence even if noise is small. Fig. 8(b) illustrates the corresponding constellation produced by AE. The receiver cannot distinguish all the transmitted signals correctly because of 4 overlapping points.

Summarily, the structure of AE-based communication system makes it has strict requirements on channel to train the network. First, it demands high SNR. Second, the AE cannot work properly in fading channels. These impede the implement of AE-based communication system in practical wireless scenarios.

V-B The Performance of DNN-based Estimation

We consider a common channel estimation problem for an OFDM system. Let ${\bf{z}}\buildrel\Delta\over{=}{\left[{H\left[0\right],H\left[1\right],\cdots,H\left[{N_{c}-1}\right]}\right]^{T}}$ which denotes channel frequency response (CFR) vector of a channel. $N_{c}$ denotes the number of subcarriers of an OFDM symbol. For the sake of convenience, we denote the measurable variable as ${\bf{v}}\buildrel\Delta\over{=}{{\bf{\hat{z}}}_{{\rm{LS}}}}$ where ${{\bf{\hat{z}}}_{{\rm{LS}}}}$ represents the LS estimation of $\bf{z}$ . Usually, it can be obtained by using training symbol-based channel estimation. Practically, the MMSE channel estimation will not be chosen, unless the covariance matrix of fading channel is known.

V-B1 MSE Performance

Fig. 9 compares the MSEs of the LS, LMMSE, and single hidden layer estimators versus SNRs. The size of the test set is $10^{5}$ . The size of neural network $d=128$ . The MSEs of the LS and LMMSE estimators are used as the benchmarks. The MSE performance of the single hidden layer neural network improves as the size of the training set increases. Compared to LS estimator, the single hidden layer neural network has superior performance, even if the size of the training set is 100. However, its performance is inferior to the LMMSE regardless of the size of the training set.

Fig. 10 compares the MSEs of neural network-based estimators with different hidden layers to that of the LS and LMMSE estimators. The sizes of these neural networks are the same with $d=128$ . Given a limited set of training data, the MSE performance of neural network-based estimator degrades with increased hidden layers. This result coincides with the description in Theorem 40.

V-B2 Information Flow

According to (44), the minimum logarithmic expected (population) risk for this inference problem is $H\left({{\bf{z}}|{{{\bf{\hat{z}}}}_{{\rm{LS}}}}}\right)$ which can be estimated by Rényi’s $\alpha$ -entropy ${S_{\alpha}}\left({{\bf{z}}|{{{\bf{\hat{z}}}}_{{\rm{LS}}}}}\right){\rm{=}}{S_{\alpha}}\left({{\bf{z}},{{{\bf{\hat{z}}}}_{{\rm{LS}}}}}\right)-{S_{\alpha}}\left({{{{\bf{\hat{z}}}}_{{\rm{LS}}}}}\right)$ with $\alpha=1.01$ . Fig. 11 illustrates the entropy ${S_{\alpha}}\left({{\bf{z}}|{{{\bf{\hat{z}}}}_{{\rm{LS}}}}}\right)$ with respect to different values of SNR and $N_{c}$ . In a practical scenario, we use linear interpolation and the number of pilots ${N_{p}}=N_{c}/4$ . As can be seen, ${S_{\alpha}}\left({{\bf{z}}|{{{\bf{\hat{z}}}}_{{\rm{LS}}}}}\right)$ monotonically decreases as the size of training set increases. When $n\to\infty$ , ${S_{\alpha}}\left({{\bf{z}}|{{{\bf{\hat{z}}}}_{{\rm{LS}}}}}\right)$ decreases slowly. It is because the joint probability distribution $p\left({{\bf{z}},{{{\bf{\hat{z}}}}_{{\rm{LS}}}}}\right)$ can be perfectly learned and therefore the empirical risk is approaching to the expected risk. Interestingly, when $n>580$ , the lower the SNR or the larger input dimension $d$ is, the smaller $n$ is needed to obtain the same value of ${S_{\alpha}}\left({{\bf{z}}|{{{\bf{\hat{z}}}}_{{\rm{LS}}}}}\right)$ .

TABLE II: Layout of the NN-based OFDM Channel Estimators

$S$	Input/Output dimension ( $2N_{c}$ )	Number of hidden layers ( $2S-1$ )
1	128	1
4	128	7

We analyze two NN-based OFDM channel estimators with different layouts (See TABLE II). Fig. 12(a), (b) and (c) illustrate the behavior of the IP-I, IP-II and IP-III in a DNN-based OFDM channel estimator with topology “ $128-64-32-16-8-16-32-64-128$ ” where linear activation function is considered and the training sample is constructed by concatenating the real and imaginary parts of the complex channel vectors. Batch size is 100 and learning rate $\eta=0.001$ . Note that in our in preliminary work [23], we find that the linear learning model may achieve better performance compared with other models, and therefore the linear activation function is chosen in this paper. We use $V$ and ${V^{\prime}}$ to denote the input and output of the decoder, respectively. The number of iterations is illustrated through a color bar. From IP-I, it can be seen that the final value of mutual information ${\rm{I}}\left({T;V^{\prime}}\right)$ in each layer tends to be equal to the final value of ${\rm{I}}\left({T;V}\right)$ , which means that the information from $V$ has been learnt and transferred to $V^{\prime}$ by each layer. In IP-II, ${\rm{I}}\left({T^{\prime};V^{\prime}}\right)<{\rm{I}}\left({T;V}\right)$ in each layer, which implies that all the layers are not overfitting. The tendency of ${\rm{I}}\left({T;V}\right)$ to approach the value of ${\rm{I}}\left({T^{\prime};V}\right)$ can be observed from IP-III. Finally, from all the IPs, it is easy to notice that the mutual information does not change significantly when the number of iterations is larger than 200. Meanwhile, according to Fig. 12(d), the MSE value reaches a very low value and also does not change sharply. It means that 200 iterations are enough for the task of 64-subcarrier channel estimation using a DNN with the above-mentioned topology.

In [26], single hidden layer feedforward neural network (SLFN)-based channel estimation and equalization scheme shows outstanding performance compared to the DNN-based scheme. However, it is still unknown whether a SLFN can completely learn the channel structural information from the training set, and then transfer the information from the input layer to the output layer. Fig. 13(a), (b) and (c) illustrate the behavior of the IP-I, IP-II, and IP-III in an SLFN-based OFDM channel estimator with topology “ $128-128-128$ ” where the other hyperparameters are same to that of the DNN with $S=4$ . In this case, the IP-I and IP-II are entirely identical since $S=1$ . From IP-I, when the number of iterations is larger than 50, it can be seen that ${\rm{I}}\left({T;V^{\prime}}\right)$ in hidden layer tends to be equal to ${\rm{I}}\left({T;V}\right)$ . The final value of ${\rm{I}}\left({T;V}\right)$ approximately equals to 3.5 which is nearly the same final value for $S=4$ . Correspondingly, the MSE value does not change significantly. Furthermore, comparing Fig. 13(d) to Fig. 12(d), the MSE value decreases more rapidly and smoothly. These mean that the SLFN with 128 hidden neurons is able to learn the same information from the training set with $N_{c}=64$ and its learning speed and quality are better than that of the DNN with $S=4$ .

VI Conclusion

In this paper, we propose a framework to understand the manner of the DNNs in physical communication. We find that a DNN-based transmitter essentially tries to produce a good representation of the information source. In terms of convergence, the AE has specific requirements for wireless channels, i.e., the channel should be AWGN with high SNR. Then, we quantitatively analyze the MSE performance of neural network-based estimators and the information flow in neural network-based communication systems. The analysis reveals that, in the practical scenario, i.e, given limited training samples, a neural network with deeper layers may has inferior MSE performance compared to a shallow one. For the task of inference (e.g., channel estimation), we verify that the decoder can learn the information from a training set, and the shallow neural network with a single hidden layer has advantages in learning speed and quality by comparing with the DNN.

We believe that this framework has the potential for the design of DNN-based physical communication. Specifically, theoretical analysis shows that the application of the neural network-based communication system with end-to-end structure has high requirements on the channel, and the practical application range is limited. Therefore, it is more suitable to deploy neural networks at receiver. Under the condition of limited training samples, the neural network with single hidden layer can achieve the optimal MSE estimation performance, and therefore an SLFN can be deployed in a receiver for the task of channel estimation. Furthermore, the size of the training set and the dimension of a DNN can be determined by the proposed framework.

In the future, limitations for the DNN under fading channels should be solved. It would be interesting to use deep reinforcement learning technique to design waveform parameters for a transmitter instead of entirely replacing it with a DNN.

Appendix A Matrix-based Functional of Rényi’s $\alpha$ -Entropy

For a random variable $X$ in a finite set $\mathcal{X}$ , its Rényi’s entropy of order $\alpha$ is defined as

{H_{\alpha}}\left(X\right)=\frac{1}{{1-\alpha}}\log\int_{\cal X}{{f^{\alpha}}\left(x\right)dx}

(47)

where $f\left(x\right)$ is the PDF of the random variable $X$ . Let $\left\{{{x^{\left(i\right)}}}\right\}_{i=1}^{n}$ be an i.i.d. sample of $n$ realizations from the random variable $X$ with PDF $f\left(x\right)$ . The Gram matrix ${\bf{K}}$ can be defined as ${\bf{K}}\left[{i,j}\right]=\kappa\left({{x_{i}},{x_{j}}}\right)$ where $\kappa:{\mathcal{X}}\times{\mathcal{X}}\mapsto{\mathbb{R}}$ is a real valued positive definite and infinitely divisible kernel. Then, a matrix-based analogue to Rényi’s $\alpha$ -entropy for a normalized positive definite matrix $\bf{A}$ of size $n\times n$ with trace 1 can be given by the functional

{S_{\alpha}}\left({\bf{A}}\right)=\frac{1}{{1-\alpha}}{\log_{2}}\left[{\sum\limits_{i=1}^{n}{{\lambda_{i}}{{\left({\bf{A}}\right)}^{\alpha}}}}\right]

(48)

where ${{\lambda_{i}}\left({\bf{A}}\right)}$ denotes the $i$ -th eigenvalue of $\bf{A}$ , a normalized version of $\bf{K}$ :

{\bf{A}}\left[{i,j}\right]=\frac{1}{n}\frac{{{\bf{K}}\left[{i,j}\right]}}{{\sqrt{{\bf{K}}\left[{i,i}\right]{\bf{K}}\left[{j,j}\right]}}}.

(49)

Now, the joint-entropy can be defined as

{S_{\alpha}}\left({{\bf{A}},{\bf{B}}}\right)={S_{\alpha}}\left[{\frac{{{\bf{A}}\odot{\bf{B}}}}{{{\rm{tr}}\left({{\bf{A}}\odot{\bf{B}}}\right)}}}\right].

(50)

Finally, the matrix notion of Rényi’s mutual information can be defined as

{I_{\alpha}}\left({{\bf{A}};{\bf{B}}}\right)={S_{\alpha}}\left({\bf{A}}\right)+{S_{\alpha}}\left({\bf{B}}\right)-{S_{\alpha}}\left({{\bf{A}},{\bf{B}}}\right).

(51)

References

[1] J. Liu, H. Zhao, D. Ma, K. Mei, and J. Wei, “Opening the black box of deep neural networks in physical layer communication,” in 2022 IEEE Wireless Communications and Networking Conference (WCNC), 2022, pp. 435–440.
[2] C. E. Shannon, “A mathematical theory of communication,” The Bell system technical journal, vol. 27, no. 3, pp. 379–423, 1948.
[3] A. W. Senior, R. Evans, J. Jumper, J. Kirkpatrick, L. Sifre, T. Green, C. Qin, A. Žídek, A. W. Nelson, A. Bridgland et al., “Improved protein structure prediction using potentials from deep learning,” Nature, vol. 577, no. 7792, pp. 706–710, 2020.
[4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
[5] O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, “Convolutional neural networks for speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 10, pp. 1533–1545, 2014.
[6] J. Hirschberg and C. D. Manning, “Advances in natural language processing,” Science, vol. 349, no. 6245, pp. 261–266, 2015.
[7] T. O’Shea and J. Hoydis, “An introduction to deep learning for the physical layer,” IEEE Transactions on Cognitive Communications and Networking, vol. 3, no. 4, pp. 563–575, 2017.
[8] T. J. O’Shea, T. Roy, N. West, and B. C. Hilburn, “Physical layer communications system design over-the-air using adversarial networks,” in 2018 26th European Signal Processing Conference (EUSIPCO). IEEE, 2018, pp. 529–532.
[9] B. Zhu, J. Wang, L. He, and J. Song, “Joint transceiver optimization for wireless communication PHY using neural network,” IEEE Journal on Selected Areas in Communications, vol. 37, no. 6, pp. 1364–1373, 2019.
[10] M. E. Morocho-Cayamcela, J. N. Njoku, J. Park, and W. Lim, “Learning to communicate with autoencoders: Rethinking wireless systems with deep learning,” in 2020 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), 2020, pp. 308–311.
[11] A. Felix, S. Cammerer, S. Dörner, J. Hoydis, and S. T. Brink, “OFDM-autoencoder for end-to-end learning of communications systems,” in 2018 IEEE 19th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC). IEEE, 2018, pp. 1–5.
[12] T. J. O’Shea, T. Erpek, and T. C. Clancy, “Physical layer deep learning of encodings for the MIMO fading channel,” in 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2017, pp. 76–80.
[13] J. Liu, K. Mei, X. Zhang, D. McLernon, D. Ma, J. Wei, and S. A. R. Zaidi, “Fine timing and frequency synchronization for MIMO-OFDM: An extreme learning approach,” IEEE Transactions on Cognitive Communications and Networking, pp. 1–1, 2021.
[14] F. A. Aoudia and J. Hoydis, “Model-free training of end-to-end communication systems,” IEEE Journal on Selected Areas in Communications, vol. 37, no. 11, pp. 2503–2516, 2019.
[15] V. Raj and S. Kalyani, “Backpropagating through the air: Deep learning at physical layer without channel models,” IEEE Communications Letters, vol. 22, no. 11, pp. 2278–2281, 2018.
[16] Z. Qin, H. Ye, G. Y. Li, and B.-H. F. Juang, “Deep learning in physical layer communications,” IEEE Wireless Communications, vol. 26, no. 2, pp. 93–99, 2019.
[17] C.-K. Wen, W.-T. Shih, and S. Jin, “Deep learning for massive MIMO CSI feedback,” IEEE Wireless Communications Letters, vol. 7, no. 5, pp. 748–751, 2018.
[18] T. Wang, C.-K. Wen, S. Jin, and G. Y. Li, “Deep learning-based CSI feedback approach for time-varying massive MIMO channels,” IEEE Wireless Communications Letters, vol. 8, no. 2, pp. 416–419, 2018.
[19] L. Li, H. Chen, H.-H. Chang, and L. Liu, “Deep residual learning meets OFDM channel estimation,” IEEE Wireless Communications Letters, vol. 9, no. 5, pp. 615–618, 2019.
[20] J. Zhang, Y. Cao, G. Han, and X. Fu, “Deep neural network-based underwater OFDM receiver,” IET communications, vol. 13, no. 13, pp. 1998–2002, 2019.
[21] Y. Yang, F. Gao, X. Ma, and S. Zhang, “Deep learning-based channel estimation for doubly selective fading channels,” IEEE Access, vol. 7, pp. 36 579–36 589, 2019.
[22] E. Balevi and J. G. Andrews, “One-bit OFDM receivers via deep learning,” IEEE Transactions on Communications, vol. 67, no. 6, pp. 4326–4336, 2019.
[23] K. Mei, J. Liu, X. Zhang, K. Cao, N. Rajatheva, and J. Wei, “A low complexity learning-based channel estimation for OFDM systems with online training,” IEEE Transactions on Communications, vol. 69, no. 10, pp. 6722–6733, 2021.
[24] L. V. Nguyen, A. L. Swindlehurst, and D. H. N. Nguyen, “Linear and deep neural network-based receivers for massive MIMO systems with one-bit ADCs,” IEEE Transactions on Wireless Communications, vol. 20, no. 11, pp. 7333–7345, 2021.
[25] H. He, S. Jin, C.-K. Wen, F. Gao, G. Y. Li, and Z. Xu, “Model-driven deep learning for physical layer communications,” IEEE Wireless Communications, vol. 26, no. 5, pp. 77–83, 2019.
[26] J. Liu, K. Mei, X. Zhang, D. Ma, and J. Wei, “Online extreme learning machine-based channel estimation and equalization for OFDM systems,” IEEE Communications Letters, vol. 23, no. 7, pp. 1276–1279, 2019.
[27] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” arXiv preprint physics/0004057, 2000.
[28] N. Tishby and N. Zaslavsky, “Deep learning and the information bottleneck principle,” in 2015 IEEE Information Theory Workshop (ITW), 2015, pp. 1–5.
[29] A. Zaidi, I. Estella-Aguerri, and S. Shamai (Shitz), “On the information bottleneck problems: Models, connections, applications and information theoretic views,” Entropy, vol. 22, no. 2, p. 151, 2020.
[30] I. E. Aguerri and A. Zaidi, “Distributed variational representation learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 1, pp. 120–138, 2021.
[31] H. Xu, T. Yang, G. Caire, and S. Shamai (Shitz), “Information bottleneck for a Rayleigh fading MIMO channel with an oblivious relay,” Information, vol. 12, no. 4, 2021. [Online]. Available: https://www.mdpi.com/2078-2489/12/4/155
[32] F. Liao, S. Wei, and S. Zou, “Deep learning methods in communication systems: A review,” in Journal of Physics: Conference Series, vol. 1617, no. 1. IOP Publishing, 2020, p. 012024.
[33] B. Sklar and P. K. Ray, Digital Communications Fundamentals and Applications. Pearson Education, 2014.
[34] S. Yu, M. Emigh, E. Santana, and J. C. Príncipe, “Autoencoders trained with relevant information: Blending Shannon and Wiener’s perspectives,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 6115–6119.
[35] J. Boutros, E. Viterbo, C. Rastello, and J.-C. Belfiore, “Good lattice constellations for both Rayleigh fading and Gaussian channels,” IEEE Transactions on Information Theory, vol. 42, no. 2, pp. 502–518, 1996.
[36] G. C. Jorge, A. A. de Andrade, S. I. Costa, and J. E. Strapasson, “Algebraic constructions of densest lattices,” Journal of Algebra, vol. 429, pp. 218–235, 2015.
[37] G. Foschini, R. Gitlin, and S. Weinstein, “Optimization of two-dimensional signal constellations in the presence of Gaussian noise,” IEEE Transactions on Communications, vol. 22, no. 1, pp. 28–38, 1974.
[38] C. Rudin and J. Radin, “Why are we using black box models in AI when we don’t need to? A lesson from an explainable AI competition,” Harvard Data Science Review, vol. 1, no. 2, 11 2019, https://hdsr.mitpress.mit.edu/pub/f9kuryi8. [Online]. Available: https://hdsr.mitpress.mit.edu/pub/f9kuryi8
[39] S. S. Du, X. Zhai, B. Poczos, and A. Singh, “Gradient descent provably optimizes over-parameterized neural networks,” arXiv preprint arXiv:1810.02054, 2018.
[40] S. Du, J. Lee, H. Li, L. Wang, and X. Zhai, “Gradient descent finds global minima of deep neural networks,” in Proceedings of the 36th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR, 09–15 Jun 2019, pp. 1675–1685. [Online]. Available: https://proceedings.mlr.press/v97/du19c.html
[41] J.-J. van de Beek, O. Edfors, M. Sandell, S. Wilson, and P. Borjesson, “On channel estimation in OFDM systems,” in 1995 IEEE 45th Vehicular Technology Conference. Countdown to the Wireless Twenty-First Century, vol. 2, 1995, pp. 815–819 vol.2.
[42] O. Edfors, M. Sandell, J.-J. van de Beek, S. Wilson, and P. Borjesson, “OFDM channel estimation by singular value decomposition,” IEEE Transactions on Communications, vol. 46, no. 7, pp. 931–939, 1998.
[43] K. Mei, J. Liu, X. Zhang, N. Rajatheva, and J. Wei, “Performance analysis on machine learning-based channel estimation,” IEEE Transactions on Communications, vol. 69, no. 8, pp. 5183–5193, 2021.
[44] G. D. Forney Jr, “Shannon meets Wiener II: On MMSE estimation in successive decoding schemes,” arXiv preprint cs/0409011, 2004.
[45] M. Diaz, P. Kairouz, and L. Sankar, “Lower bounds for the minimum mean-square error via neural network-based estimation,” arXiv preprint arXiv:2108.12851, 2021.
[46] S. Yu and J. C. Príncipe, “Understanding autoencoders with information theoretic concepts,” Neural Networks, vol. 117, pp. 104–123, 2019.
[47] L. G. S. Giraldo, M. Rao, and J. C. Príncipe, “Measures of entropy from data using infinitely divisible kernels,” IEEE Transactions on Information Theory, vol. 61, no. 1, pp. 535–548, 2014.

Jun Liu received the B.S. degree in optical information science and technology from the South China University of Technology (SCUT), Guangzhou, China, in 2015, and the M.E. degree in communications and information engineering from the National University of Defense Technology (NUDT), Changsha, China, in 2017, where he is currently pursuing the Ph.D. degree with the Department of Cognitive Communications. He was a visiting Ph.D. student with the University of Leeds from 2019 to 2020. His current research interests include machine learning with a focus on shallow neural networks applications, signal processing for broadband wireless communication systems, multiple antenna techniques, and wireless channel modeling.

Haitao Zhao (Senior Member, IEEE) received his B.E., M.Sc. and Ph.D. degrees all from the National University of Defense Technology (NUDT), P. R. China, in 2002, 2004 and 2009 respectively. And he is currently a professor in the Department of Cognitive Communications, College of Electronic Science and Technology at NUDT. Prior to this, he visited the Institute of ECIT, Queen’s University of Belfast, UK and Hong Kong Baptist University. His main research interests include wireless communications, cognitive radio networks and self-organized networks. He has served as a TPC member of IEEE ICC’14-22, Globecom’16-22, and guest editor for IEEE Communications Magazine. He is a senior member of IEEE.

Dongtang Ma (SM’13) received the B.S. degree in applied physics and the M.S. and Ph.D. degrees in information and communication engineering from the National University of Defense Technology (NUDT), Changsha, China, in 1990, 1997, and 2004, respectively. From 2004 to 2009, he was an Associate Professor with the College of Electronic Science and Engineering, NUDT. Since 2009, he is a professor with the department of cognitive communication, School of Electronic Science and Engineering, NUDT. From Aug. 2012 to Feb. 2013, he was a visiting professor at University of Surrey, UK. His research interests include wireless communication and networks, physical layer security, intelligent communication and network. He has published more than 150 journal and conference papers. He is one of the Executive Directors of Hunan Electronic Institute. He severed as the TPC member of PIMRC from 2012 to 2020.

Kai Mei received the master’s degree from the National University of Defense Technology, in 2017, where he is currently pursuing the Ph.D. degree. His research interests include synchronization and channel estimation in OFDM systems and MIMO-OFDM systems, and machine learning applications in wireless communications.

$\displaystyle{\mathbb{E}}\left[{{{\cal L}_{\log}}\left({{\bf{\hat{z}}},{\bf{z}};{{\bf{\Theta}}_{g}}}\right)}\right]$	$\displaystyle=\sum\limits_{{\bf{v}}\in\mathcal{V},{\bf{z}}\in\mathcal{Z}}{p\left({{\bf{v}},{\bf{z}}}\right)\log\left({\frac{1}{{Q\left({{\bf{z}}\|{\bf{v}}}\right)}}}\right)}$	(44)
	$\displaystyle=\sum\limits_{{\bf{v}}\in\mathcal{V},{\bf{z}}\in\mathcal{Z}}{p\left({{\bf{v}},{\bf{z}}}\right)\log\left({\frac{1}{{p\left({{\bf{z}}\|{\bf{v}}}\right)}}}\right)}{\rm{+}}$
	$\displaystyle~{}~{}~{}\sum\limits_{{\bf{v}}\in\mathcal{V},{\bf{z}}\in\mathcal{Z}}{p\left({{\bf{v}},{\bf{z}}}\right)\log\left({\frac{{p\left({{\bf{z}}\|{\bf{v}}}\right)}}{{Q\left({{\bf{z}}\|{\bf{v}}}\right)}}}\right)}$
	$\displaystyle=H\left({{\bf{z}}\|{\bf{v}}}\right)+{D_{{\rm{KL}}}}\left({p\left({{\bf{z}}\|{\bf{v}}}\right)\|\|Q\left({{\bf{z}}\|{\bf{v}}}\right)}\right)$
	$\displaystyle\geq H\left({{\bf{z}}\|{\bf{v}}}\right)$

Theoretical Analysis of Deep Neural Networks in Physical Layer Communication

Abstract

Index Terms:

I Introduction

II System Model

II-A Traditional Communication System

II-B Understanding Autoencoders on Message Transmission

III Encoder: Finding a Good Representation

III-A Traditional Method: Gradient-Search Algorithm

III-B Neural Network-based Method

III-B1 Gaussian Channel

III-B2 Fading Channel

III-C Properties of Convergence

IV Decoder: Inference

IV-A Traditional Estimation Method

IV-A1 LS Estimator

IV-A2 LMMSE Estimator

IV-B DNN-based Estimation

Assumption 1

Theorem 1

Proof 1

IV-C Information Flow in Neural Networks

V Simulation Results

V-A Constellation and Convergence of AE-based Communication System

V-A1 Gaussian Channel

V-A2 Rayleigh Flat Fading Channel

V-B The Performance of DNN-based Estimation

V-B1 MSE Performance

V-B2 Information Flow

VI Conclusion

Appendix A Matrix-based Functional of Rényi’s α\alpha-Entropy

References

Appendix A Matrix-based Functional of Rényi’s $\alpha$ -Entropy