Image Restoration under Semantic Communications ††thanks: This research is funded by Hanoi University of Science and Technology (HUST) under project number T2021-SAHEP-002.
Abstract
Semantic communication has emerged as the breakthrough beyond the Shannon theorem by transmitting and receiving semantic information instead of data bits or symbols regardless of its content. This paper proposes a two-stage reconstruction process to boost the system’s performance. In the first phase, the image information is first decoded from the noisy received data by exploiting the channel knowledge. The decoded image is enhanced by a post-filter and image statistics. Different metrics are exploited to evaluate the image restoration quality of our considered model. Numerical results are obtained using natural images that verify the superior improvements of the proposed two-stage reconstruction process over the traditional decoded data. Moreover, the different metrics assessing the system performance based on their criteria can be conflicted with each other.
Index Terms:
Image transmission, semantic communications, bit error ratio, peak signal-to-noise ratio, structural similarityI Introduction
Communication technologies have redefined the way people communicate, which have recently attracted a lot of interest via the so-called semantic communications. It has become one of the main engines for propelling the development of society [1]. Until the development of fifth-generation (5G) mobile networks, most communication systems are designed based on rate metrics, for instance, throughput, spectrum/energy efficiency, and latency. The semantic aspect is treated as irrelevant to an engineering problem by Claude Shannon, and this perspective is still being kept until now [2]. Yet this point of view is about to change, as we are approaching the Shannon physical capacity limit with advanced encoding (decoding) and modulation techniques, semantic communications are expected to play an important role in sixth-generation [3, 4]. In the 5G systems, Massive MIMO (multiple-input multiple-output) technology has become one of the core technologies and it will also be the foundation for the next generations of mobile networks [5]. This technology allows focusing the signal to the user by using antenna arrays to beamforming towards the user directly, avoiding energy waste and increasing transmission efficiency[6].
The transmission of images and videos has been one of the main problems of telecommunications networks for a long time. With the exponential growth rate of data traffic on transferring images, streaming, video conferencing or online learning, which takes up a large number of system resources, requiring longer transmission and processing time, a highly efficient communication paradigm is needed more than ever. Furthermore, image and video data structures are strongly correlated and require a completely different processing mechanism than other random data [7]. As mobile data speeds increase, the amount of multimedia data transmitted will also increase in the following years.
I-A Related Works
Many researches in the literature have been carried out to find ways to apply semantic communication to enhance the transmission performance of these multimedia signals. In [8], the authors discussed the semantic communication principles, architecture, its application areas, and the approach to it with machine learning (ML) and information of things (IoTs). P. Jiang et al.[9] have proposed semantic-based video conferencing network, where the authors used the incremental redundancy hybrid automatic repeat request (IR-HARQ) feedback framework to guarantee the quality of the transmission according to signal-to-noise ratio (SNR) and channel state information (CSI). A deep learning enabled semantic communication systems (DeepSC-S) for speech signals transmission was proposed in [10], which learns and extracts speech signals at the transmitter and then recovering them at the receiver from the received features directly. Then, the authors compared its performance with the traditional approaches to verify the effectiveness of the system, especially in deeply faded environments. To the best of our knowledge, there is no related work considering using semantic information at the receiver to enhance the reconstruction quality and evaluating the system performance by the different evaluation metrics.
I-B Contributions
This paper studies an uplink massive multiple-input multiple-output (MIMO) semantic transmission system where a user transmits image data to the base station (BS). The sematic image data includes the spatial correlation between the pixels instead of the randomly non-structure signals. The data is then modulated and transmitted over the wireless environments. At the BS, the alternate direction method of multipliers (ADMM) is first used to reconstruct the transmitted image signals by minimizing the transmission error over the random fluctuations of fading channels. After that, we consider using a post-filter to improve the reconstruction quality from the output of the ADMM via mitigating residual errors based on image correlation structure [11]. This paper considers the traditional filters comprising a median filter, a Gaussian filter, and a block-matching and 3D filtering. Finally, the performance of image data recovery is assessed by three parameters commonly used in wireless communication (the physical layer) and multimedia (the application layer), namely, the bit error rate (BER), the peak signal-to-noise ratio (PSNR), and the structural similarity (SSIM) over different signal-to-noise ratio (SNR) values. Numerical results demonstrate the significant importance in exploiting the prior image information to enhance the reconstruction quality. Besides, the performance metric usually used in the physical layer may be contradicted to one in the application layer.
II System Model and Problem Formulation
The semantic communication system involves two stages, comprising the decoded stage and the post-processing restoration stage. Specifically, we consider a user communicates with the BS, both equipped with multiple antennas. After receiving the transmitted signals, the image data is decoded at the BS. In the post-transmission restoration phase, the reconstructed image is enhanced by a low-pass filter exploiting image structure.

II-A System Model
We consider an uplink of a Massive MIMO system where the BS is equipped with antennas communicates with a user having antennas. Let us suppose that this user selects the semantic content of the image data and sends it to the BS. The image content will be simply stacked into column vectors as follows . These column vectors are further modulated so that each vector is mapped to the constellation points of the alphabet set . Each value in the vector corresponds to one or several constellation points. This mapping process will be represented as , where are the modulated symbols that will be transmitted over the MIMO channel and received at the BS; indicates the mapping or the modulation. In particular, let us define the received signal at the BS after propagating over the environment as
(1) |
where is the channel matrix between the user and the BS; and represents the additive white Gaussian noise (AWGN), where each element follows a circularly symmetric complex Gaussian variable with zero mean and variance . Here, denotes the circular symmetric complex Gaussian noise and is the transmitted power allocated to each modulated data symbol. The signal-to-noise ratio (SNR) is be defined by . Once the BS has received all the noisy signal , it will decode and exploit the inverse projection to reconstruct the image, denoted by as where represents the demodulation or the inverse mapping. Let us simply denote . We emphasize that, from the signal received in (1) and the given instantaneous channels, the BS decodes the image data information by the recovery model presented hereafter.
II-B Problem Formulation
The received measurement data at the BS contains noise, artifacts, and loss from the environment. The contamination may be extremely severe, especially at a low SNR regime. Our goal is to reconstruct the image with the highest amount of desired information retained as the original semantic image data . Consequently, the optimization problem, which we would like to solve, is formulated as follows
(2) | ||||
subject to | ||||
where is the Frobenius norm. The objective function of problem (2) is to recover the semantic image data that is close to the original one. The mathematical operation in the first constraint reconstructs the entire image from the partial image information. This optimization problem is non-convex from the mathematical points of view. Besides, the received measurements are contaminated by the fading channels during transmission and noise at the receiver, so the useful information may be lost. In addition, problem (2) is only based on the features of the physical layer. In other words, it does not consider the image structure as a constraint, and therefore there is room to acquire a better the reconstruction quality than its locally optimal solution.
III Image Recovery and Restoration
This section considers the two-stage image reconstruction process. In the first stage, the ADMM algorithm is brought to obtain an initial decoded image. Then, in the second stage, the image structure is exploited by a low-pass filter to enhance the reconstruction quality.
III-A Image Recovery
By utilizing the ADMM algorithm in [12, 13], the first stage can obtain a local solution to problem (2). This algorithm actually approximates the inherent nonconvexity of (2) to a convex optimization problem. From the received measurement data and the channel matrix , the ADMM algorithm will iteratively decode which is the constellation point contained in the received data measurement. As wished, should be coincided with the modulated symbol sent by the user. From the set , one can attain the decoded image information, which is reshaped and restored in . As presented in the previous section, the decoded image may still not good in terms of, for example the BER, PSNR, and SSIM due to the locality of the solution. Therefore, after the first stage, the decoded image will be further processed to improve the image quality by expecting that texture and fine details of the image will be better captured. It is motivated by exploiting the image structure from the fact that natural images are strongly correlated. The correlation is indeed deployed in the lowpass filters and we can borrow them in our applications.
III-B Image Restoration
In the second stage, the noise and residual errors are removed by a filtering process with an effective kernel, which may be classified into two types: pixel-based kernel, e.g., Gaussian filter and median filter, and patch-based kernel, e.g., block-matching and 3D (BM3D) filter. This section will look at the basis of these techniques.
III-B1 Median filter
is a non-linear filter, popularly used in image restoration due to its effectiveness at removing noise and preserving the strong edges. This filter is particularly effective against a typical noise called “salt and pepper” noise. Specifically, the median is calculated by sorting all the pixel values of an image area (surrounding neighborhood) into numerical order, and then the median pixel value is utilized instead of the considered pixels.
III-B2 Gaussian filter
is utilized to blur images and therefore remove noise. This filter works by using the bivariate distribution as a point-spead function. The goal is achieved by convolving an noisy image with the a Gaussian generated kernel. Mathematically, the filtered image is obtained as
(3) |
where is the convolution operator; and is the Gaussian kernel with being the kernel’s size; and is the noisy image. The Gaussian kernel is defined as
(4) |
where each pair stands for a coordinate in the kernel with and is the standard derivation of the distribution. We stress that the smoothness level of a Gaussian filter is controlled by changing the variance .
III-B3 BM3D filter
is the patch-based noise reduction filter proposed by Dabov et. al. in [14]. The patch-based filtering algorithms have become popular in image processing nowadays. In the ideology of noise reduction, one averages patches with the similar structure, noise will be cancelling each, particularly effective as noise is symmetric with zero mean. The advantage of this method is that it smooths out flat areas while preserving fine details and sharp edges. However, this type of filter often has high computational complexity due to the nonlocality of similar pixels, resulting in longer time consumption and the penalty parameters are also nontrivial to select.
The proposed algorithm in [14] was based on transformed sparse representation in the frequency domain. The noisy image is first processed by extracting reference blocks from it and for each block, find blocks that are similar to it by using block matching and stack them together to form a three dimensional array (group). Then the collaborative filtering is performed of the group and the 2D estimates of all grouped blocks are returned to their original location. These estimated blocks can be overlap to each other, thus in the end, the authors aggregate all the estimates to form the final filtered imaged. The advantage of this filter is that it both helps reduce the noise of the image while also preserving the details of the original image.
Remark 1
This paper has focused on enhancing the reconstruction quality of images transmitted over the fading channels. The semantic information, e.g., the spatial correlation and noise information, among pixels are exploited to mitigate the remaining noise and artifacts.
IV Evaluation Metrics
In this section, we take a look at the metrics used to assess the performance of a telecommunications system and evaluate the quality of the image. Thereby, we have an intuitive view of the transmission efficiency and the influence of post-filtering process on the performance of a wireless network. In addition, we analyze the correlation of evaluation metrics used for wireless systems and for image signals.
IV-A BER Metric
When it comes to transmitting data from one point to another, the key metric to evaluate is how many errors will appear in the data that appears at the remote end, thus make the BER metric an important metric in measuring the performance of a communication system or a data channel. As the name implies, the BER is defined as the rate at which errors occur in a transmission system. The definition can be translated to the following formula
(5) |
where and is the total number of error bits and the total number of bits that are transmitted, respectively. Based on BER, we can evaluate the quality of the transmission process as well as the effectiveness of using digital filters in restoring the original image. In a communication system, the receiver side BER may be affected by different transmission factors, for instance, channel noise, interference, attenuation, multipath fading, etc.
BER is an metric that are mainly used at the physical layer, so it only considers the accuracy of the transmitted data to the original data without considering the actual quality of the image and the effect of noise. Therefore, we need to look at other metrics used to evaluate image quality.










IV-B PSNR Metric
The PSNR is an expression for the ratio of the maximum possible value (power) of a signal to the power of distortion noise that affects the quality of the signal. Because many signals have a very wide dynamic range, (ratio between the largest and smallest possible values of a changeable quantity) the PSNR is usually expressed in terms of the logarithmic decibel scale.
Assume that the data we are considering is a 2D digital image signal. Let be the original image data matrix of size size , is the image data matrix to be compared and must have the same size as the original. Mathematically, PSNR is defined as follows:
(6) |
where MAX is the maximum pixel value, depending on the original image data. For example, if the input image has a double-precision floating-point data representation, then MAX is 1. If it has an 8-bit representation, MAX is 255. The remaining parameter is MSE, which means the Mean Squared Error of the two image and is calculated as follows:
(7) |
MSE and the PSNR are both utilized to measure the quality of two images. The MSE represents the cumulative squared error between the compressed and the original image, whereas PSNR represents a measure of the peak error. The lower the value of MSE, the lower the error, and the higher the PSNR, the more similar restored image to the original image, thus means the better result acquired. The main limitation of this metric is that two degraded images with the same PSNR value could contain very different types of errors, some of which are easier to detect than others, resulting in different image quality.



IV-C SSIM Metric
Unlike the PSNR, the SSIM metric is often utilized to evaluate the similarity or similarity between the noisy signal and the original data. It was first proposed in [15] by Zhou Wang, based on the assumption that human visual perception is highly adaptive to copying structural information from an image. This assessment is designed by modeling any image data using three factors: luminance, contrast and structure [16].
To calculate the SSIM value of two image data f and g, first we must calculate the illuminance value of each image:
(8) |
where is the number of pixels in the image , i.e., , thus making illuminance the average of all the pixels. After measuring the illuminance, we calculate the contrast which is the standard deviation of all pixels as follows
(9) |
The SSIM formula is based on three comparison measurements between the input image: luminance, contrast and structure. However, from these two values, we can calculate the SSIM index based on the reduced formula as follows
(10) |
in which , are the constants with L being the dynamic range of pixel values and the constant . The covariance between two images and (reshaped into vectors of the elements), denoted by , is defined as
(11) |
The resultant SSIM index is a decimal value between 0 and 1, and value 1 is only reachable in the case of two identical sets of data and therefore indicates perfect structural similarity. A value of 0 indicates no structural similarity. The difference with other techniques such as MSE or PSNR is that structural information considers pixels to have strong inter-dependencies with each other, especially when they are spatially close. These dependencies carry important information about the structure of the objects in the visual scene.
V Experimental Evaluation
This section provides the simulation results of the natural image data recovery process for the uplink data transmission of a massive MIMO communication system. By varying the SNR values, numerical results evaluate the reconstruction quality of the system by utilizing the metrics BER, PSNR, SSIM as presented in Section IV. The base station is equipped with antennas, while the user terminal has 4 antennas. We use -QAM to modulate and map image data onto the corresponding constellation points before transmitting over the wireless channels. In order to scrutinize the communication reliability, the SNR value is considered in the range from [dB] to [dB]. The filtering parameters are selected to attain the best system performance. In particular, at the SNR dB, the standard derivation of the Gaussian filter is and that of the BM3D filter is . the filtering parameter is decreased the SNR gets higher due to less residual errors. The propagation channels follow the Rayleigh fading distribution and noise follows a circular symmetric complex Gaussian distribution.
V-A Visualization Comparison
We observe the representation of reconstructed images with the SNR value of 10 [dB] in Fig. 2. The images in the first column are the original images or called the ground truth. The second column includes the decoded images after applying the ADMM algorithm. Meanwhile, the refined images result from deploying the Gaussian, median, and BM3D filters, respectively. We can observe that the decoded image by the ADMM algorithm still contains lots of noise and artifacts. This is because the ADMM algorithm considered in this paper decodes the image data based on the channel information only. By applying a low-pass filter to smooth out the decoded image, we observe much the better restoration quality. Noise and artifacts have been greatly reduced by using the Gaussian and median filters, but besides that, the edges and details are also lost. More advance, the BM3D filter offer the ability to preserve the edges and details of the image, but cannot completely remove the noise from the image, especially the noise in the Monarch image is still clearly visible. Median filters are very good at handling salt and pepper noise type due to their characteristics. Details are retained relatively well, comparable to the results of the BM3D filter, like the tripod in the Cameraman image. In the following, we investigate the system performances by different metrics as mentioned.
V-B BER Performance
In Fig. 4, we visualize the simulation results of the average BER value of 12 natural images in the Set12 dataset [17]. It shows that the blue line corresponding to the case where the image is restored without a filter gives the lowest bit error rate and gradually decreases inversely with the SNR of the channel, means that when the channel’s state gets better, the BER would decrease. However, as the SNR increases, the filtered image’s BER stays very high and does not seem to improve, which leads to the conclusion, that the filters are not able to restore the recovered image to its original version. Instead, it will sacrifice image details to increase the quality, as can be seen in Fig. 2. This seems to be contradictory because in the previous section, we noticed that utilizing filters improve the image quality significantly compared to contaminated image after transmission and restoration using the ADMM algorithm. To clarify this disagreement, we need to understand that the purpose of ADMM algorithm is to solve the optimization problem and minimize the transmission error. Therefore, any subsequent action by the filter would increase the difference of data bits, resulting in a higher BER. This is the reason why the average BER value in the case of using the filter is not as good as the BER value in the case of not using the filter. It leads to the conclusion: For structured data such as digital image signals, BER is no longer a decisive criterion for evaluating the quality of the recovered data.
V-C PSNR Performance
As mentioned above, the BER index in this case is not suitable for assessing the quality of the recovered image. Therefore, we need to use metrics commonly used in image processing. Figure 4 depicts the average PSNR value results of the different cases mentioned earlier. The results show that with the PSNR index, the image quality after going through the filter is significantly higher than the image at the output of the ADMM algorithm. The Median filter and the BM3D filter give approximately the same results, but the weak point of the Median filter is that it does not have a control parameter, leading to a dropped image quality at higher channel’s SNR.
V-D SSIM Performance
Another parameter commonly used in image processing to evaluate quality is the SSIM index. Fig. 5 is the average SSIM result of the images in the Set12 dataset. In general, the image after passing through the filter gives significantly better SSIM results than the original image, especially in the low and medium SNR regions. The Median filter performs best in this area but as the channel quality gets better, similar to PSNR, the output image no longer improve. The BM3D filter here due to being optimized for the highest PSNR result did not give good performance at SSIM metric.
VI Conclusion
This paper investigated the content-based image transmission over the wireless networks. We focused on the image recovery at the BS with the two-stage reconstruction process. First, the BS exploits the channel state information to decode the image data from the received noisy measurements. After that, the semantic information, such as spatial correlation, is explicitly exploited to refine the reconstructed image. We have demonstrated by numerical results that the image content can be effectively manipulated to refine the reconstructed images. Moreover, the traditional evaluation metrics may be conflicted to each other. Consequently, defining a new evaluation metric to cover multiple features of semantic communications should be a potential research direction for future work.
References
- [1] T. H. Nguyen, W.-S. Jung, L. T. Tu, T. Van Chien, D. Yoo, and S. Ro, “Performance analysis and optimization of the coverage probability in dual hop LoRa networks with different fading channels,” IEEE Access, vol. 8, pp. 107 087–107 102, 2020.
- [2] C. E. Shannon, “A mathematical theory of communication,” The Bell System Technical Journal, vol. 27, no. 3, pp. 379–423, 1948.
- [3] E. C. Strinati and S. Barbarossa, “6g networks: Beyond shannon towards semantic and goal-oriented communications,” 2020. [Online]. Available: https://arxiv.org/abs/2011.14844
- [4] X. Luo, H.-H. Chen, and Q. Guo, “Semantic communications: Overview, open issues, and future research directions,” IEEE Wireless Communications, vol. 29, no. 1, pp. 210–219, 2022.
- [5] T. H. Nguyen, T. Van Chien, H. Q. Ngo, X. N. Tran, and E. Björnson, “Pilot assignment for joint uplink-downlink spectral efficiency enhancement in massive mimo systems with spatial correlation,” IEEE Transactions on Vehicular Technology, vol. 70, no. 8, pp. 8292–8297, 2021.
- [6] E. Björnson, J. Hoydis, and L. Sanguinetti, “Massive mimo networks: Spectral, energy, and hardware efficiency,” Foundations and Trends® in Signal Processing, vol. 11, pp. 154–655, 01 2017.
- [7] S. Sun, T. He, and Z. Chen, “Semantic structured image coding framework for multiple intelligent applications,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 9, pp. 3631–3642, 2020.
- [8] Q. Lan, D. Wen, Z. Zhang, Q. Zeng, X. Chen, P. Popovski, and K. Huang, “What is semantic communication? a view on conveying meaning in the era of machine intelligence,” 2021. [Online]. Available: https://arxiv.org/abs/2110.00196
- [9] P. Jiang, C.-K. Wen, S. Jin, and G. Y. Li, “Wireless semantic communications for video conferencing,” 2022. [Online]. Available: https://arxiv.org/abs/2204.07790
- [10] Z. Weng and Z. Qin, “Semantic communication systems for speech transmission,” 2021. [Online]. Available: https://arxiv.org/abs/2102.12605
- [11] C. Van Trinh, K. Q. Dinh, and B. Jeon, “Edge-preserving block compressive sensing with projected landweber,” in 2013 20th International Conference on Systems, Signals and Image Processing (IWSSIP). IEEE, 2013, pp. 71–74.
- [12] P. T. K. Chinh, T. Van Chien, T. M. Hoang, and N. T. Hoa, “On the performance of image recovery in massive mimo communications,” in 2020 IEEE Eighth International Conference on Communications and Electronics (ICCE). IEEE, 2021, pp. 487–491.
- [13] T. Van Chien, K. Q. Dinh, B. Jeon, and M. Burger, “Block compressive sensing of image and video with nonlocal Lagrangian multiplier and patch-based sparse representation,” Signal Processing: Image Communication, vol. 54, pp. 93–106, 2017.
- [14] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image denoising by sparse 3-d transform-domain collaborative filtering,” IEEE Transactions on Image Processing, vol. 16, no. 8, pp. 2080–2095, 2007.
- [15] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.
- [16] A. Horé and D. Ziou, “Image quality metrics: Psnr vs. ssim,” in 2010 20th International Conference on Pattern Recognition, 2010, pp. 2366–2369.
- [17] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3142–3155, 2017.