Deep Learning for Spectrum Prediction in Cognitive Radio Networks: State-of-the-Art, New Opportunities, and Challenges

Guangliang Pan, David K. Y. Yau,
Bo Zhou, and Qihui Wu This work was supported in part by the National Natural Science Foundation of China under Grants 62231015 and 62201255. (Corresponding authors: David K. Y. Yau, Qihui Wu)G. Pan, B. Zhou, and Q. Wu are with the College of Electronic and Information Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, 211106, China (e-mail: {glpan2020, b.zhou, wuqihui}@nuaa.edu.cn).D. Yau is with the Pillar of Information Systems Technology and Design, Singapore University of Technology and Design, Singapore 487372 (e-mail: david_yau@sutd.edu.sg).

Abstract

Spectrum prediction is considered to be a promising technology that enhances spectrum efficiency by assisting dynamic spectrum access (DSA) in cognitive radio networks (CRN). Nonetheless, the highly nonlinear nature of spectrum data across time, frequency, and space domains, coupled with the intricate spectrum usage patterns, poses challenges for accurate spectrum prediction. Deep learning (DL), recognized for its capacity to extract nonlinear features, has been applied to solve these challenges. This paper first shows the advantages of applying DL by comparing with traditional prediction methods. Then, the current state-of-the-art DL-based spectrum prediction techniques are reviewed and summarized in terms of intra-band and cross-band prediction. Notably, this paper uses a real-world spectrum dataset to prove the advancements of DL-based methods. Then, this paper proposes a novel intra-band spatiotemporal spectrum prediction framework named ViTransLSTM. This framework integrates visual self-attention and long short-term memory to capture both local and global long-term spatiotemporal dependencies of spectrum usage patterns. Similarly, the effectiveness of the proposed framework is validated on the aforementioned real-world dataset. Finally, the paper presents new related challenges and potential opportunities for future research.

Index Terms:

Cognitive radio networks, spectrum prediction, deep learning, deep transfer learning, ViTransLSTM.

I Introduction

The wide application of 5G mobile communication and Internet of Things (IoT) network technology has made the number of wireless devices grow by hundreds of millions every year, which puts a huge load on the already scarce radio spectrum resources. The International Telecommunication Union (ITU) allocates spectrum resources to these wireless devices in a static manner, restricting their communication to licensed frequency bands. Consequently, licensed bands experiencing an inundation of wireless devices face congestion issues and degradation in the quality of service [1]. In contrast, other licensed bands, such as those designated for broadcasting TVs or analogue cellular telephony, possess abundant available spectrum that is regrettably underutilized.

To address these problems, cognitive radio network (CRN)-based dynamic spectrum access (DSA) is considered an effective solution. It enables wireless devices, referred to as secondary users (SUs), to opportunistically access the unused licensed spectrum as long as harmful interference to primary users (PUs) is limited. This access method can achieve better spectrum efficiency and higher system capacity, thereby alleviating the shortage of spectrum resources. The key step in DSA is to accurately obtain the spectrum state by spectrum sensing in cognitive radio (CR). Spectrum sensing techniques in various frequency bands include energy detection, matched filtering ( $<$ 1 GHz), feature detection (sub-6 GHz), compressed sensing, and machine learning (millimeter wave and terahertz) techniques. However, practical hardware constraints limit spectrum sensing, including factors such as limited time delay, energy constraints, and sensing scopes. Fortunately, spectrum prediction can obtain unknown spectrum occupancy information in advance by capturing potential correlation patterns in the measured spectrum data to help CUs quickly access idle bands [2]. Herein, spectrum prediction focuses on providing future spectrum state, while spectrum sensing emphasizes obtaining the current state. Therefore, spectrum prediction has emerged as a research hotspot.

However, spectrum prediction has some tricky challenges. This paper divides the existing work into intra-band prediction and cross-band prediction. In intra-band prediction, spectrum measurement is highly nonlinear due to diverse wireless devices, varied spectrum services, and influences from both the external environment and internal device interference. This nonlinearity spans time, frequency, and space, making learning multidimensional nonlinear features challenging. Cross-band prediction is the prediction of the future spectrum of the target band by using a few spectrum measurements in the target band (typically caused by spectrum security, device deployment, and hardware damage, etc) and abundant spectrum measurements in relevant bands. Effectively leveraging the substantial data from relevant bands and the limited data from the target band to achieve cross-band prediction poses a formidable challenge.

Recently, inspired by the stunning breakthroughs that deep learning (DL) has achieved in computer vision and natural language processing, DL has also been harnessed in spectrum prediction [3, 4, 5]. DL can directly extract nonlinear usage patterns from spectrum measurements and integrate various types of networks such as convolutional neural networks (CNN) and recurrent neural networks (RNN) to capture multidimensional correlations (see [5] for specific correlations). In cross-band prediction, DL enhances the performance of target band prediction by increasing the number of samples in the target band using generative adversarial networks (GAN) and transferring useful knowledge from related bands.

Based on these motivations, this paper provides a thorough overview of DL-based spectrum prediction in CRN. Specifically, we first demonstrate the advantages of applying DL by conducting a comparative analysis with traditional spectrum prediction methods. Furthermore, we review existing works and summarize them into two routes: intra-band prediction [3, 4, 5, 6, 7, 8] and cross-band prediction [9, 10, 11, 12], with detailed information available in Table II. Secondly, to enhance spatiotemporal spectrum prediction performance, we propose a novel spatiotemporal spectrum prediction framework named ViTransLSTM by combining visual self-attention and long short-term memory (LSTM). This framework introduces a shifted-window-based visual self-attention mechanism into the LSTM gate structure and memory cells, enabling it to capture both local and global spatiotemporal dependencies by collaboratively accumulating memory with each gate structure. We validate the effectiveness of the proposed framework on a real-world spectrum dataset. Finally, we provide future challenges related to spectrum prediction in real-world wireless systems and provide DL-based research directions for these challenges. The basic concepts and technical terms mentioned in this paper are summarized in Table I.

Refer to caption — Figure 1: System model.

II Why DL for Spectrum Prediction?

As shown in Fig. 1, we consider a CRN consisting of several PUs and SUs. The spectrum sensors (SSs), which are sparsely distributed in the region of interest (RoI), transmit the received signal strength (RSS) over the Internet to the data storage center (DSC). Referencing the upper left corner of Fig. 1, spectrum measurement can be divided into time, frequency, and space domains. DL enables spectrum prediction by learning the underlying temporal/spectral/spatial correlations in historical spectrum measurement. Below, we give the reasons for using DL by comparing it with traditional methods.

AR Model-Based Methods. Influenced by user behavior and radio equipment activity, latent time patterns exist in the time dimension of spectrum, encompassing periodicity and trends. To explore these patterns, AR and moving average (MA) models use parameter estimation to analyze the influence of past spectrum series on the current moment, aiming to better capture temporal patterns for predicting future spectrum states [2]. The ARMA model, which combines AR and MA, is then used to further enhance analytical capabilities. Subsequently, a more advanced AR integrated MA (ARIMA) model is proposed to eliminate trends or seasonal effects by introducing differential operations [2]. However, these models assume that the relationship between the past and future values of the spectrum series is linear. In real-world scenarios, many spectrum series exhibit nonlinear behavior, and these models may struggle to capture such patterns. In contrast, DL does not require any prior assumptions and can automatically extract complex nonlinear patterns, thereby achieving accurate spectrum prediction.

Traditional Machine Learning (ML)-Based Methods. Traditional ML methods, including support vector machine (SVM), hidden Markov model (HMM), and Bayesian inference have been utilized for spectrum prediction [2]. Authors of [2] introduce that SVM predicts spectrum mobility based on design features, HMM predicts spectrum occupancy with state transition matrices, and Bayesian inference estimates channel quality using a posterior distribution. However, SVM usually requires manual selection and crafting of features. HMM comes with limitations in terms of context modeling, relying solely on previous observations and struggling to capture long-term dependencies. Bayesian methods face constraints due to their reliance on specific probabilistic distributions, limited access to prior knowledge, and sensitivity to the selection of prior distributions. Moreover, these models are only used to capture features in the temporal dimension of the spectrum, ignoring the frequency and spatial dimensions. In contrast, DL takes an end-to-end approach, eliminating the need for manual feature engineering. DL-models are highly flexible and can adapt to a wide range of tasks without substantial modifications. Moreover, DL-models can capture long-term spectrum patterns to support accurate prediction.

III Learning Nonlinearities for Intra-Band Spectrum Prediction

III-A Time-Frequency Spectrum Prediction

Various frequency bands are allocated for diverse spectrum services, encompassing applications such as broadcasting TV, GSM900 (both uplink and downlink), ISM, GSM1800 (both uplink and downlink), and similar services. The interplay of factors such as the number of spectrum devices, user mobility, and patterns of usage imparts unique temporal and spectral correlations to each spectrum service. The intricacies of this multifaceted time-frequency correlation are notably nonlinear, presenting a challenge for conventional methodologies. To overcome this challenge, the application of DL-models becomes pertinent. These models prove effective in capturing the nuanced time-frequency correlations by directly assimilating spectrum correlations from measurements across distinct frequency bands, subsequently encapsulating this information in a parameterized format within the neural network architecture.

TABLE I: Introduction to Basic Concepts and Technical Terminology.

Concepts and Terms	Introduction
CRN	An intelligent wireless communication system that dynamically adapts its operations to the surrounding environment by sensing and utilizing available spectrum resources efficiently.
Intra-band Prediction	It involves inferring the future spectrum evolution of a given band based on its past spectrum measurements over a certain period.
Cross-band Prediction	It involves inferring the future spectrum evolution of a related but different band using the past spectrum measurements of both the given band and the related band.
Deep Learning	A subset of machine learning that utilizes neural networks with multiple layers to automatically extract and learn complex patterns and representations from large amounts of data.
Temporal patterns	The potential trends and cycles in spectrum data over time (such as proximity and diurnal cycles).
Tensor Completion	It estimates the missing or unobserved entries in a partially observed tensor (a multi-dimensional array).
Deep TL	It combines DL with TL techniques to leverage pre-trained neural networks for efficiently adapting knowledge from one domain to improve performance in a different but related domain.
Multimodal Learning	An approach that integrates and processes information from multiple data modalities, such as text, images, and audio, to achieve a more comprehensive understanding and enhanced predictive performance.
Knowledge Drift	The phenomenon where the relevance or applicability of knowledge transferred from a source domain diminishes or becomes inconsistent when applied to a dynamically changing target domain.

As an attempt, LSTM was employed to learn the temporal correlations of multiple spectrum channels. Gao et al. [3] utilized LSTM and attention mechanisms to design a sequence-to-sequence model for achieving multi-channel, multi-step spectrum prediction. Supervised learning is adopted for the model optimization. Once the model is adequately trained, supervised learning transitions to the prediction phase. In this stage, only a specific length of historical spectrum data is input into the prediction model to predict the future spectrum state without any prior information. To further improve predictive performance, Yu et al. [4] combined CNN and gated recurrent unit (GRU) (a variant of the LSTM) networks to design a model called DCG for spectrum availability prediction. Recognizing the local correlations among different channels within the same time slot and regional correlations across multiple time slots, Yu et al. employed one-dimensional convolution to delve into the occupancy patterns of local channels and two-dimensional convolution to explore the occupancy patterns of regional channels. Subsequently, GRU was leveraged to encapsulate both short-term and long-term temporal dependencies within the data processed by the dual CNN.

Given the extensive range of frequency bands utilized by wireless devices, the amalgamation of vast spectrum data inherently results in high-dimensional data. Simultaneously, there is a heightened demand for the model’s proficiency in extracting features. To tackle this challenge, Pan et al. [5] initially employed stacked autoencoders (SAE) to diminish the dimensionality of the high-dimensional spectrum data while automatically extracting features without disrupting its internal temporal and frequency relationships. The extracted features were then fed into a CNN and bidirectional LSTM (Bi-LSTM) fusion network to grasp the frequency and time dependencies. However, due to the constraints of the Bi-LSTM gating structure, it demonstrated excellent short-term predictive performance but relatively weaker long-term predictive performance. To surmount this limitation, Pan et al. [6], drawing inspiration from the Transformer architecture, devised a long-term spectrum prediction method named Autoformer-CSA. Autoformer-CSA employs a self-attention mechanism, integrating series spatial and channel attention modules with autocorrelation mechanisms. This enables the model to allocate varying degrees of attention to different positions in the spectrum sequence, effectively capturing the long-term trends and seasonal characteristics of spectrum data.

Fig. 2 presents a comparison of the root mean square error (RMSE) performance between the DL-based approach and the traditional HMM prediction method across various prediction ranges. The dataset utilized for these comparative experiments is derived from real-world spectrum data collected by sensors deployed in the city center of Madrid, Spain (obtained via the Electrosense open API, https://electrosense.org/). The scanned frequency range spans 600-640 MHz with a 2 MHz sampling interval. Data collection took place from June 1, 2021, to June 8, 2021, with a sampling interval of 1 mins. The division of the training set, validation set, and test set followed a 5:1:1 ratio. Fig. 2 shows that the RMSE errors of the DL-based prediction methods are significantly lower than those of the traditional HMM-based prediction method. For instance, at a prediction range of 96 mins, LSTM with attention, DCG, and Autoformer-CSA exhibit 14.66%, 30.58%, and 39.77% higher RMSE performance, respectively, compared to HMM.

TABLE II: DL for radio spectrum prediction: state-of-the-art.

Learning Nonlinearities for Intra-Band Spectrum Prediction
Research routes	Refs.	Learning methods	DL models	Key features
Time-frequency spectrum prediction	[3]	Supervised learning	LSTM with Attention	Learn multi-channel temporal correlations
	[4]	Supervised learning	CNN-GRU	Learn temporal-spectral correlations
	[5]	Unsupervised learning and supervised learning	SAE-Bi-LSTM	Dimensionality reduction, automatic feature extraction, and temporal-spectral correlation learning
	[6]	Supervised learning	Transformer	Learn the long-term dependence of spectrum series
Time-frequency-space spectrum prediction	[7]	Supervised learning	PredRNN	Learn temporal-spectral-spatial correlations
	[8]	Supervised learning	CNN-ResNet	Use the ResNet to reduce gradient disappearance and learn spatio-temporal correlations
Learning Enhancement for Cross-Band Spectrum Prediction
Research routes	Refs.	Learning methods	DL models	Key features
DGAN for cross-band spectrum prediction	[9]	Unsupervised learning and supervised learning	GAN	Improve cross-band prediction performance with data enhancement
	[10]	Unsupervised learning and transfer learning	DCGAN	Improve cross-band prediction performance with data enhancement and fine-tuning of model parameters
DTL for cross-band spectrum prediction	[11]	Transfer learning	DTL	Use naive parameter transfer to improve the performance of target band prediction
	[12]	Transfer learning	DTL	Use weighted parameter transfer to improve the performance of target band prediction

III-B Time-Frequency-Space Spectrum Prediction

Unlike traditional time-frequency prediction, DL-based spatiotemporal spectrum prediction faces two key challenges: reconstructing spectrum map data for large RoIs and capturing heterogeneous time-frequency-space correlations.

To overcome these challenges, Li et al. [7] initially delved into historical data collected from a network of sparsely distributed spectrum sensors. They employed an inverse-distance spatial interpolation method to extrapolate the spectrum map across the entire RoI. They utilized predictive recurrent neural network (PredRNN) to discern the spatial-temporal correlations embedded in the spectrum map. Employing three identical PredRNN components, they modeled the temporal proximity, daily cycle, and weekly trend of the spectrum image. Ultimately, Li et al. used parameter matrices to integrate the features extracted from these three components. This refined approach to feature extraction has proven effective in significantly enhancing the model’s predictive performance.

According to the path loss model in [8], it’s clear that wireless signals follow exponential decay with distance. As the average received power at nearby locations tends to be similar, there’s a higher spatial correlation. In simpler terms, only sensors close to the unsensed area can provide signal power information that has a certain correlation with the data in the unsensed area. Therefore, Ren et al. [8] went a step further and employed a neighboring spatial interpolation method to estimate the spectrum map for the entire RoI. This method is compared with a tensor completion, and the proposed approach achieves the minimum error rate at a low sparsity level. Note that in addition to the above methods there are Kriging, kernel-based, and improved tensor completion methods. Ren et al. proposed a spatiotemporal spectrum prediction method combining CNN and ResNet to capture spectrum matrix spatiotemporal features, using skip connections to improve performance and prevent gradient vanishing.

IV Learning Enhancement for Cross-Band Spectrum Prediction

In real-world spectrum services, certain frequency bands, like those used for cellular mobile communication, can gather a substantial amount of spectrum data. Conversely, other bands, such as military bands, may have only limited available spectrum data. For frequency bands with abundant labeled data, we can directly employ DL-models for training and prediction. However, in scenarios where data is scarce, training DL-models becomes challenging, resulting in less-than-optimal predictive performance. To tackle this challenge, there are currently two methods: one involves increasing the amount of trainable data for the target band to support model training, while the other involves transferring knowledge from related bands with ample available data to the target prediction network. Below, we detail how DL drives both methods.

IV-A DGAN for Cross-Band Spectrum Prediction

A deep GAN (DGAN) consists of a generator and a discriminator. The GAN’s aim is to train the generator network to produce realistic data, while the discriminator network distinguishes between generated and real data. These two networks engage in mutual adversarial training, propelling the learning and improvement of the model. Consequently, GAN can be employed to boost the quantity of trainable data for the target band.

Peng et al. [9] introduced a spectrum data conversion GAN to generate realistic data for the target band. Initially, they used Fréchet inception distance (FID) to measure differences between the target band and other bands, pinpointing the source band with the least difference from the target predictive band. Peng et al. then crafted a generator using a blend of CNN and LSTM to transform data from the source band to the target predictive band. Subsequently, the prediction model utilized both the transformed data and the original data from the target band for training and prediction.

It’s essential to note that this method doesn’t directly generate the target prediction data but leverages highly similar data from other bands. This is because traditional GANs have specific requirements for the quantity of target samples, and insufficient target samples can lead to GAN instability. Additionally, this approach reduces the dependency of the GAN on existing data from the target band. Lin et al. [10] similarly utilized FID to identify a source band with high similarity to the target band. They then devised a deep convolutional GAN (DCGAN) for pre-training using the source band. Subsequently, transfer learning was employed to fine-tune the pre-trained model onto the target band, generating data with minimal differences. Finally, the generated data for the target band, along with the original data, was used for training and prediction in a residual prediction model.

IV-B DTL for Cross-Band Spectrum Prediction

Transfer learning (TL) aims to distill knowledge from one or multiple source tasks and apply that knowledge to related but different target tasks. Building upon this TL concept, deep transfer learning (DTL) has emerged as a secondary approach to tackle this challenge. DTL efficiently achieves cross-band spectrum prediction by transferring pre-trained models from the source band to the target task.

Lin et al. [11] implemented cross-band prediction through a naive TL approach by transferring a prediction model comprised of LSTM units. To ensure positive TL, Lin et al. analyzed the similarity between source and target bands and used dynamic time warping to measure the similarity of each frequency point between source and target bands. Subsequently, Lin et al. employed a transfer component analysis method, utilizing maximum mean discrepancy as the distance measurement, based on the marginal distribution of features to obtain features with strong transferability.

For an additional performance enhancement, Li et al. proposed a transfer time-frequency fusion attention network (T-TF²AN) to achieve cross-band spectrum prediction. The primary challenge in applying TL for TF²AN stems from the gaps between source and target bands, resulting from variations in spatial, temporal, or spectral domains. Li et al. addressed these challenges through weighted TL, adaptively discovering and transferring shared knowledge while mitigating the negative impact of specific domain patterns from the source spectrum. The core components of the designed weighted TL are the shared pattern learner and the loss with adaptive weights. The shared pattern learner assigns higher weights to source spectrum data similar to the target spectrum, incorporating more shared patterns. Adaptive weights are determined by appropriately weighting the loss from the source domain, diminishing the risk of negative transfer and effectively enhancing the model’s performance on the target domain dataset. It’s noteworthy that the T-TF²AN model not only considers cross-band prediction but also takes into account spatial and temporal transferability.

V A ViTransLSTM Spatiotemporal Spectrum Prediction Framework

V-A Preblem Description

Assuming that there are $S$ sparsely distributed spectrum sensors in a RoI divided into $M(\text{rows})\times N(\text{columns})$ grids to measure the radio signal power in the area. Considering the hardware cost, it is not feasible to deploy sensors at every location. Therefore, we adopt the commonly used inverse-distance spatial interpolation method to fill in the missing signal power at unmeasured locations. These measurements at any given time step $t$ can be represented as a tensor $\mathcal{X}_{t}\in\mathbb{R}^{C\times M\times N}$ , where the measurements are mapped to three channels ( $C=3$ ) using a common Jet colormap. The spatiotemporal spectrum prediction problem involves predicting a spectrum sequence of the most probable length- $K$ time steps in the future based on a spectrum sequence of historical length- $J$ time steps, including measurements at the current time step, represented as

	$\displaystyle\hat{\mathcal{X}}_{t+1}$	$\displaystyle,\cdots,\hat{\mathcal{X}}_{t+K}=$		(1)
		$\displaystyle\mathop{\text{arg max}}\limits_{\mathcal{X}_{t+1},\cdots,\mathcal{X}_{t+K}}p(\mathcal{X}_{t+1},\cdots,\mathcal{X}_{t+K}\|\mathcal{X}_{t-J+1},\cdots,\mathcal{X}_{t}).$		(1)

V-B ViTransLSTM Framework

In spatiotemporal spectrum prediction, most works adopt convolutional structures to capture spatial correlations, such as PredRNN used in [7]. However, due to the limitations of convolutional kernel size, convolutional structures may struggle to capture global information. The fixed parameter-sharing mechanism in convolution operations makes it challenging for the model to handle spatiotemporal correlations at different locations. Compared to convolutional structures, vision Transformers (ViTs) excel at capturing spatial correlations due to their self-attention-based global learning patterns. Inspired by this, we integrate ViT and PredRNN to propose an extended spatiotemporal spectrum prediction framework called ViTransLSTM. The innovation of this framework lies in replacing the core component ST-LSTM in PredRNN with our designed ViT-LSTM component. The overall structure of the ViT-LSTM component is shown on the left of Fig. 3.

As a variant of ST-LSTM, spatiotemporal ViT-LSTM consists of two sets of gate structures: a standard temporal ViTransMemory and a spatiotemporal ViTransMemory. In the standard temporal ViTransMemory, all the inputs $\mathcal{X}_{t-J+1}$ , $\cdots$ , $\mathcal{X}_{t}$ , hidden states $\mathcal{H}_{t-J+1}$ , $\cdots$ , $\mathcal{H}_{t}$ , cell outputs $\mathcal{C}_{t-J+1}$ , $\cdots$ , $\mathcal{C}_{t}$ , and gates, i.e., input ViTransGate $i^{vit}_{t}$ , input-modulation ViTransGate $g^{vit}_{t}$ , forget ViTransGate $f^{vit}_{t}$ are 3D tensors in $\mathbb{R}^{C\times M\times N}$ . In the spatiotemporal ViTransMemory, all the inputs $\mathcal{X}_{t-J+1}$ , $\cdots$ , $\mathcal{X}_{t}$ , hidden states $\mathcal{M}^{l-1}_{t-J+1}$ , $\cdots$ , $\mathcal{M}^{l-1}_{t}$ , cell outputs $\mathcal{M}^{l}_{t-J+1}$ , $\cdots$ , $\mathcal{M}^{l}_{t}$ , and gates ${i^{{}^{\prime}}}^{vit}_{t}$ , ${g^{{}^{\prime}}}^{vit}_{t}$ , ${f^{{}^{\prime}}}^{vit}_{t}$ , $o^{vit}_{t}$ are also 3D tensors in $\mathbb{R}^{C\times M\times N}$ . The equations of ViT-LSTM in the $l$ -th layer are shown as follows:

		$\displaystyle g^{vit}_{t}=\text{tanh}(\text{vit}(\mathcal{X}_{t})+\text{vit}(\mathcal{H}_{t-1}^{l})+b_{g})$		(2)
		$\displaystyle i^{vit}_{t}=\sigma(\text{vit}(\mathcal{X}_{t})+\text{vit}(\mathcal{H}_{t-1}^{l})+b_{i})$
		$\displaystyle f^{vit}_{t}=\sigma(\text{vit}(\mathcal{X}_{t})+\text{vit}(\mathcal{H}_{t-1}^{l})+b_{f})$
		$\displaystyle\mathcal{C}_{t}^{l}=f^{vit}_{t}\odot\mathcal{C}_{t-1}^{l}+i^{vit}_{t}\odot g^{vit}_{t}$
		$\displaystyle{g^{{}^{\prime}}}^{vit}_{t}=\text{tanh}(\text{vit}(\mathcal{X}_{t})+\text{vit}(\mathcal{M}_{t}^{l-1})+b^{{}^{\prime}}_{g})$
		$\displaystyle{i^{{}^{\prime}}}^{vit}_{t}=\sigma(\text{vit}(\mathcal{X}_{t})+\text{vit}(\mathcal{M}_{t}^{l-1})+b^{{}^{\prime}}_{i})$
		$\displaystyle{f^{{}^{\prime}}}^{vit}_{t}=\sigma(\text{vit}(\mathcal{X}_{t})+\text{vit}(\mathcal{M}_{t}^{l-1})+b^{{}^{\prime}}_{f})$
		$\displaystyle\mathcal{M}_{t}^{l}={f^{{}^{\prime}}}^{vit}_{t}\odot\mathcal{M}_{t}^{l-1}+{i^{{}^{\prime}}}^{vit}_{t}\odot{g^{{}^{\prime}}}^{vit}_{t}$
		$\displaystyle o^{vit}_{t}=\sigma(\text{vit}(\mathcal{X}_{t})+\text{vit}(\mathcal{H}_{t-1}^{l})+\text{vit}(\mathcal{C}_{t}^{l})+\text{vit}(\mathcal{M}_{t}^{l})+b_{o})$
		$\displaystyle\mathcal{H}_{t}^{l}=o^{vit}_{t}\odot\text{tanh}(\text{linear}([\mathcal{C}_{t}^{l},\mathcal{M}_{t}^{l}])),$

where $\sigma$ is the sigmoid activation function, $\ast$ is the convolution operator, $\odot$ is the Hadamard product, $[\cdot,\cdot]$ is the operation of concatenating two tensors, and $\text{vit}(\cdot)$ is two consecutive Swin Transformer [13] blocks (see Fig. 3, right), which includes a standard multi-head self-attention module with non-overlapping window (W-MSA), a MSA module with non-overlapping shifted window (SW-MSA), and a feed-forward network (FFN, is a 2-layer multilayer perceptron (MLP), with Gaussian error linear unit (GELU) non-linearity in between) following each MSA module. Layer normalization (LN) is applied before each MSA module and FFN, and a residual connection is applied after each module.

As shown in (2), unlike ST-LSTM, ViT-LSTM introduces two new designs: ViTransGate and ViTransMemory. Specifically, 1) ViTransGate: This design replaces the convolutional networks in all gate structures of the original ST-LSTM with ViT (i.e., Swin Transformer). Taking the input ViTransGate $i^{vit}_{t}$ (see Fig. 3, right) as an example, it is obtained by feeding $\mathcal{X}_{t}$ and $\mathcal{H}_{t-1}^{l}$ into the $\text{vit}(\cdot)$ module, followed by the activation function. 2) ViTransMemory: From (2), this design feeds the tamporal memory unit $\mathcal{C}_{t}^{l}$ and the spatiotemporal memory unit $\mathcal{M}_{t}^{l}$ separately into the $\text{vit}(\cdot)$ .

Compared to the original ST-LSTM, the advantages of ViT-LSTM include: 1) The ViTransGates in ViT-LSTM capture local and global spectral pattern features of the spectrum map by computing self-attention within the standard window and the shifted window, respectively. In contrast, the original ST-LSTM can only capture local features through convolution operations, which may lead to the loss of critical spatial spectrum pattern features. 2) The ViTransMemory also employs the self-attention mechanism to integrate spectral information from different positions, capturing more diverse feature representations to enhance memory. Additionally, it works in conjunction with LSTM to store long-range spatiotemporal spectrum dependencies. In contrast, convolution operations may exhibit biases toward local patterns, potentially leading to the loss of specific memory features.

V-C Experimental Results

To validate the effectiveness of the proposed ViTransLSTM, we use spectrum data collected by four sensors via the Electrosense open API (https://electrosense.org/) as the dataset. This dataset is reconstructed into $64\times 64$ spectrum maps based on inverse distance spatial interpolation of the data from these four sensors (sensor ID: $\text{test}\_\text{yago}\_3$ , $\text{test}\_\text{yago}$ , $\text{rack}\_3$ , and $\text{rack}\_2$ ). The data collection period spans from Jun. 1 to Jun. 3, 2021, with a frequency band of 610 MHz and a sampling interval of 1 mins. The dataset (a total of 3200 samples) is divided into training, validation, and test sets in a ratio of 4:1:1. In the ViTransLSTM, the window size of the $\text{vit}(\cdot)$ network is $4\times 4$ , the patch size is $4\times 4$ , and the number of heads is 8. We select recently proposed spectrum prediction models, including ConvLSTM, PredRNN [7], and the upgraded version PredRNN V2 [14], as baseline comparisons. We run all the experiments on a PC with 3.30 GHz Intel Core i9-10940X CPU, NVIDIA GTX 3090Ti graphic, and 64 GB RAM using the Pytorch 1.8.0. All models are trained using the Adam optimizer with a starting learning rate of 0.001. The training process is stopped after 1000 iterations. Unless otherwise specified, batch size is 4.

Under the setup of input-10-predict-10, Fig. 4 presents the comparison results of the proposed ViTransLSTM and all baselines in terms of average MSE and PSNR. As shown in Fig. 4, the proposed ViTransLSTM outperforms all baselines across all evaluation metrics. For example, in Fig. 4(a), the average MSE of ViTransLSTM is 28.99%, 19.70%, and 15.58% lower than that of ConvLSTM, PredRNN, and PredRNN V2, respectively. Similarly, in Fig. 4(b), the average PSNR of ViTransLSTM is 5.29%, 3.13%, and 2.72% higher than that of ConvLSTM, PredRNN, and PredRNN V2, respectively. These results demonstrate that the proposed ViTransLSTM, which integrates visual self-attention with LSTM, can effectively capture deeper spatiotemporal correlation features of spectrum usage patterns, thereby achieving more accurate predictions.

VI Research Challenges and Opportunities

Although DL has achieved beneficial performance gains in spectrum prediction, there are still numerous open challenges for further study, as summarized in Fig. 5 at a glance.

VI-A Spectrum Prediction with Incomplete and Corrupted Data

Owing to interference from spectrum measuring equipment, signal propagation environments, and other variables, spectrum measurements may contain missing and abnormal data. This compromise in data quality amplifies the complexity of modeling. In an initial exploration, tensor completion is used to address this challenge. However, tensor completion algorithms struggle to estimate nonlinear missing data accurately. Fortunately, the diffusion model uses a stepwise de-noising approach to grasp the distribution of nonlinear spectrum data, resulting in the generation of high-quality spectrum data that is robust to deletions and anomalies.

VI-B Multi-Modal Integration Assisted Spectrum Prediction

Despite achieving impressive accuracy, existing DL-based spectrum prediction methods rely only on uni-modal data, highlighting the need for a robust DL-based multi-modal fusion framework to integrate features from each modality (such as spectrum occupancy series and spectrogram). To the best of our knowledge, deep multi-modal learning has yet to be applied in the field of spectrum prediction, despite its successful implementation in other domains such as weather and traffic predictions. However, a significant challenge involves the necessity to assign relative weights to different modalities. A viable approach could use a multi-objective optimization algorithm to balance the optimal weight of each modal by treating each weight as an optimization objective.

VI-C High Complexity with Multi-Tasking Spectrum Prediction

Multi-tasking spectrum prediction is often considered to meet the customization needs of massive spectrum users. However, when various DL-based spectrum prediction models are deployed in multiple spectrum prediction tasks, the intrinsic complexity of the models often necessitates substantial computational resources for training. To address this issue, it is recommended to customize lightweight DL models for spectrum prediction to achieve a flexible balance between performance and complexity through pruning and knowledge distillation techniques. Further, the lightweight model uses distributed learning [15] for each task node to improve computing efficiency.

VI-D Cross-RoI Spatiotemporal Spectrum Prediction

Unlike cross-band spectrum prediction, cross-region spatiotemporal spectrum prediction extends into the time-frequency-space dimension. TL addresses this challenge, with cross-region TL going beyond time-frequency knowledge to include spatial information. This introduces complexity, leading to a challenge known as knowledge drift. As more knowledge transfers, the problem of knowledge drift becomes more severe. Enhancing transfer learning efficiency and spectrum prediction accuracy in the target region is a significant challenge. To address this, importance weight TL reweights source domain knowledge to improve model performance. Furthermore, cross-region cross-band spectrum prediction based on DTL presents an even more formidable problem. The relevance of the target domain to the source domain is lower compared to the previous challenge, intensifying the knowledge drift problem in transfer learning. Refinement TL emerges as a solution, involving the classification of transfer knowledge to improve transfer efficiency.

VI-E Spectrum Prediction for Multi-Domain Information Fusion

The electromagnetic spectrum is evolving, integrating space, air, and ground communication modes like vehicle-to-vehicle (V2V), unmanned aerial vehicle (UAV)-assisted, air-to-ground, and satellite communication. These modes create complex spectrum usage patterns. Joint spectrum prediction for space, air, and ground is challenging due to multi-domain information fusion, addressed by spectrum information semanticization. Additionally, traditional-DL struggles in dynamic wireless systems, requiring frequent retraining for changing data distributions, consuming time and memory. Deep reinforcement learning offers a solution by training intelligent agents to perform tasks and derive optimal strategies from experience. In highly dynamic wireless systems, where the environment constantly changes, the agent receives feedback through a reward mechanism and makes rational decisions.

VII Conclusion

We have commenced by elaborating on the motivation of applying DL in spectrum prediction. Firstly, DL is eminently suitable for extracting the complex nonlinear features encountered. Secondly, the adaptive fitting capability of DL enables parameter knowledge fine-tuning required by the predicted band change. Based on these motivations, DL is proposed for extracting both the intrinsic nonlinear time-frequency- and time-frequency-space-domain features, thereby inspiring short/long-term spectrum predictions. Further, DL was employed for data enhancement of target prediction bands and knowledge transfer of source bands to achieve cross-band spectrum prediction. Then, a framework combining visual self-attention and LSTM, named ViTransLSTM, is proposed to achieve high-precision spatiotemporal spectrum prediction. We have validated the effectiveness of the proposed framework on a real-world spectrum dataset. Finally, we have provided the future associated challenges and potential opportunities of applying DL to spectrum prediction.

References

[1] H. Wang, Q. Wu, and W. Chen, “Movable antenna enabled interference network: Joint antenna position and beamforming design,” IEEE Wireless Commun. Lett., vol. 13, no. 9, pp. 2517–2521, 2024.
[2] G. Ding, Y. Jiao, J. Wang, Y. Zou, Q. Wu, Y.-D. Yao, and L. Hanzo, “Spectrum inference in cognitive radio networks: Algorithms and applications,” IEEE Commun. Surv. Tutor., vol. 20, no. 1, pp. 150–182, 2017.
[3] Y. Gao, C. Zhao, and N. Fu, “Joint multi-channel multi-step spectrum prediction algorithm,” in Proc. IEEE Veh. Technol. Conf., 2021, pp. 1–5.
[4] L. Yu, Y. Guo, Q. Wang, C. Luo, M. Li, W. Liao, and P. Li, “Spectrum availability prediction for cognitive radio communications: A DCG approach,” IEEE Trans. Cogn. Commun. Netw., vol. 6, no. 2, pp. 476–485, 2020.
[5] G. Pan, Q. Wu, G. Ding, W. Wang, J. Li, F. Xu, and B. Zhou, “Deep stacked autoencoder based long-term spectrum prediction using real-world data,” IEEE Trans. Cogn. Commun. Netw., vol. 9, no. 3, pp. 534–548, 2023.
[6] G. Pan, Q. Wu, G. Ding, W. Wang, J. Li, and B. Zhou, “An autoformer-CSA approach for long-term spectrum prediction,” IEEE Wireless Commun. Lett., vol. 12, no. 10, pp. 1647–1651, 2023.
[7] X. Li, Z. Liu, G. Chen, Y. Xu, and T. Song, “Deep learning for spectrum prediction from spatial–temporal–spectral data,” IEEE Commun. Lett., vol. 25, no. 4, pp. 1216–1220, 2021.
[8] X. Ren, H. Mosavat-Jahromi, L. Cai, and D. Kidston, “Spatio-temporal spectrum load prediction using convolutional neural network and ResNet,” IEEE Trans. Cogn. Commun. Netw., vol. 8, no. 2, pp. 502–513, 2022.
[9] C. Peng, R. Zhu, M. Zhang, and L. Wang, “Cross-band spectrum prediction algorithm based on data conversion using generative adversarial networks,” China Commun., vol. 20, no. 10, pp. 136–152, 2023.
[10] F. Lin, J. Chen, G. Ding, Y. Jiao, J. Sun, and H. Wang, “Spectrum prediction based on GAN and deep transfer learning: A cross-band data augmentation framework,” China Commun., vol. 18, no. 1, pp. 18–32, 2021.
[11] F. Lin, J. Chen, J. Sun, G. Ding, and L. Yu, “Cross-band spectrum prediction based on deep transfer learning,” China Commun., vol. 17, no. 2, pp. 66–80, 2020.
[12] K. Li, C. Li, J. Chen, Q. Zhang, Z. Liu, and S. He, “Boost spectrum prediction with temporal-frequency fusion network via transfer learning,” IEEE Trans. Mobile Comput., vol. 22, no. 6, pp. 3209–3223, 2023.
[13] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2021, pp. 10 012–10 022.
[14] Y. Wang, H. Wu, J. Zhang, Z. Gao, J. Wang, P. S. Yu, and M. Long, “Predrnn: A recurrent neural network for spatiotemporal predictive learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 2, pp. 2208–2225, 2023.
[15] M. Chen, D. Gündüz, K. Huang, W. Saad, M. Bennis, A. V. Feljan, and H. V. Poor, “Distributed learning in wireless networks: Recent progress and future challenges,” IEEE J. Sel. Areas Commun., vol. 39, no. 12, pp. 3579–3605, 2021.