SC-HVPPNet: Spatial and Channel Hybrid-Attention Video Post-Processing Network with CNN and Transformer
††thanks: * Corresponding author.
Abstract
Convolutional Neural Network (CNN) and Transformer have attracted much attention recently for video post-processing (VPP). However, the interaction between CNN and Transformer in existing VPP methods is not fully explored, leading to inefficient communication between the local and global extracted features. In this paper, we explore the interaction between CNN and Transformer in the task of VPP, and propose a novel Spatial and Channel Hybrid-Attention Video Post-Processing Network (SC-HVPPNet), which can cooperatively exploit the image priors in both spatial and channel domains. Specifically, in the spatial domain, a novel spatial attention fusion module is designed, in which two attention weights are generated to fuse the local and global representations collaboratively. In the channel domain, a novel channel attention fusion module is developed, which can blend the deep representations at the channel dimension dynamically. Extensive experiments show that SC-HVPPNet notably boosts video restoration quality, with average bitrate savings of 5.29%, 12.42%, and 13.09% for Y, U, and V components in the VTM-11.0-NNVC RA configuration.
Index Terms:
Video compression, CNN, Transformer, VVC, Post-processing, Quality enhancementI Introduction
With the boom in Ultra High Definition (UHD) video, there is a growing demand for video services with higher resolution and lower data storage requirements. Efficient video compression techniques have also become one of the current hot topics within the field of computer vision. Accordingly, several new video coding standards are released for video compression and transmission, including Versatile Video Coding (VVC) [2], High Efficiency Video Coding (HEVC) [18], Advanced Video Coding (AVC) [21] and Elementary Video Coding (EVC) [16]. Compared to HEVC/H.265, VVC/H.266 better supports high spatial resolution and high dynamic range. Considering the benefits of VVC, many novel coding techniques are integrated into the latest VVC standards, which greatly improves coding performance.
For the sake of improving video reconstruction quality and reducing noticeable visual artifacts, CNN and Transformer are widely proposed. Recently, many video post-processing methods [14, 6] are proposed, which can be roughly categorized into the following three groups: CNN-based methods, Transformer-based methods, CNN and Transformer fusion-based methods. Specifically, 1) CNN-based methods aim to improve the video compression performance of existing coding modules. Among them, some algorithms [3, 12] focus on residual learning and dense connections, which leads to a considerable coding performance. Furthermore, to concentrate on the issue of content adaptation, a series of algorithms [5, 7] are proposed, which alleviates the gap between training and test data. However, limited to the size of the convolution kernel, the receptive field of CNN makes it difficult to model global information, which limits its performance. 2) Transformer-based methods focus on the global interaction by the self-attention mechanism, which achieves promising performance in computer vision tasks [22]. Specifically, Swin Transformer [10] is introduced into video projects so as to promote compression performance to a certain extent. Nevertheless, Transformer also faces the obstacle of being insensitive to position when calculating global attention and ignoring details of local features. To cope with the above issues,

3) CNN and Transformer fusion-based methods [13, 9] are devised to take advantage of both CNN and Transformer, thereby enhancing the representation capability of local features and global representations. Nonetheless, the integration approaches of current fusion-based methods are relatively monotonous. Thus, the essential issue is to explore the interaction between CNN and Transformer.
To address these issues, in this paper, we explore the interaction between CNN and Transformer in the task of video post-processing, and propose a novel Spatial and Channel Hybrid-Attention Video Post-Processing Network (SC-HVPPNet), which can cooperatively exploit the image priors in both spatial and channel domains. In the spatial domain, a novel spatial attention fusion module is designed, in which two attention weights are generated to fuse the local and global representations collaboratively. In the channel domain, a novel channel attention fusion module is developed, which can blend the deep representations at the channel dimension dynamically. Extensive experiments show that SC-HVPPNet notably boosts video restoration quality in the VTM-11.0-NNVC RA configuration.
Compared with existing VPP methods, the contributions of this paper can be summarised below:
1) We propose a novel Spatial and Channel Hybrid-Attention Video Post-Processing Network (SC-HVPPNet), which can cooperatively exploit the image priors in both spatial and channel domains.
2) In the spatial domain, a Spatial Attention Fusion Module (SAFM) is presented to collaboratively fuse the local and global representations by the respective attention weights.
3) In the channel domain, a Channel Attention Fusion Module (CAFM) is developed to dynamically blend the deep representations at the channel dimension.
4) Extensive experiments show that SC-HVPPNet notably boosts video restoration quality, with an average reduction in bitrate of 5.29%, 12.42%, and 13.09% for Y, U, and V components in the VTM-11.0-NNVC RA configuration.
II THE PROPOSED METHOD
II-A Overview of SC-HVPPNet
As illustrated in Fig. 1, given the lossy video frame (W, H and are the frame height, width and input channel number), and the corresponding QP value , a convolutional layer is employed to fuse them initially. Meanwhile, the channel dimension is increased and the shallow feature ( is the feature channel number) of the frame is extracted. Then, several Hybrid Fusion Blocks (HFB) perform deep feature extraction on . Concretely, the HFB initially preprocesses the shallow feature using several Residual Blocks (RB), and subsequently, for feature restoration, it employs a Hybrid-Attention Fusion Module (HAFM) that considers feature interaction from both spatial and channel domains. To recover the residual map output by these HFBs, a convolutional layer is applied. Ultimately, the restored residual is superimposed with the original lossy video frame through skip connections to obtain the final reconstructed video frame .
II-B The Network Architecture
To clearly describe, we first introduce the individual feature interactions within the spatial and channel domains, followed by the cooperation between spatial and channel features. Finally, we detail the specifics of the network architecture.
Feature interaction in spatial domain: With this regard, two modules are comprised: the Feature Extraction Module (FEM) and the Spatial Attention Fusion Module (SAFM). The FEM is responsible for initially processing , which is extracted as for local representation and for global representation. The specific structure of FEM will be discussed in the latter of this section. As for the SAFM, the polarized attention and data enhancement concepts of [8] are employed to explore the weight of and in the spatial domain (represented as and ). As shown in Fig. 2, SAFM is a three-branch fusion network based on the non-local self-similar structure. Two convolutional layers are utilized to achieve the query feature maps on and . Moreover, to attain the key feature map , and are aggregated to a fused feature in a roughly way primarily. Then is passed through a convolutional layer and is flattened to a size of for subsequent weight calculations. After that, two Polarized Self-Attention Blocks (PSAB) are applied to reduce the computational complexity of the spatial similarity matrix, which applies a global pooling to reduce the dimension of and into a size of . Then, we reshape the feature to a size of and employ a Softmax function to intensify these two query feature maps. Crucially, we calculate the correlation of and with correspondingly for each, which represents the weight of and in . Finally, we apply the View and Softmax2D functions to normalize the obtained two weights, yielding the final weights and . Consequently, the process of SAFM can be expressed in (1):
(1) |
where is the Softmax2D function, is the View function, is the Flatten function and ”” is the matrix dot-product operation.
Feature interaction in channel domain: Similar to the spatial domain, the feature interaction in the channel domain also contains the same FEM and the corresponding Channel Attention Fusion Module (CAFM). The CAFM is applied to excavate the weight of and in the channel domain (represented as and ). As shown in Fig. 2, a initially fusion feature is aggregated by and firstly. To shift information from the recovery domain to the weight domain, is passed to a global average pooling followed by a fully connected layer. Finally, after passing through a fully connected layer and a softmax function by each, and are computed. Therefore, the mechanism of CAFM is formulated in (2):
(2) |
where represents the Softmax1D function, represents the fully connected layer and performs the GlobalPool operation.
Feature interaction of spatial and channel domains: As enclosed by the orange dashed line in Fig. 1, and are sequentially weighted through channel fusion weights and spatial fusion weights, ultimately leading to the feature interaction in both spatial and channel domains. Notably, the blue and green lines represent the local information flow and global information flow, individually. Hence, the deep fusion feature can be formulated in (3):
(3) | ||||
where are the deep fusion weights which take both space and channel domains into account and is the multiplication operation.
Other network details: The arrangement of RB in HFB involves a skip connection consisting of two convolutional layers with 1 PReLU. Moreover, the FEM in both spatial and channel domains is composed of a Local Feature Extraction Module (LFEM) based on CNN and a Global Feature Extraction Module (GFEM) based on Swin-Transformer, respectively. As detailed in Fig. 1, LFEM applies several cascaded convolutional layers with PReLU. To bridge the gap between the 3D features of CNN and the 2D format of Transformer, GFEM uses a convolutional layer with a kernel to downsample the into the size of . And the split feature is flattened to for subsequent Swin Blocks. Specifically, the Swin Blocks use MSA and MLP for calculating the global similarity and the nonlinear transformation. To align the shape of with that of , we reshape the 2D feature to a 3D feature ( ) by the View function. A PixelShuffle layer is utilized to upsample the feature to a size of afterward. Finally, is obtained.

Y-PSNR | U-PSNR | V-PSNR | Y-MSIM | U-MSIM | V-MSIM | |
---|---|---|---|---|---|---|
Class A1 | -5.10% | -8.66% | -9.46% | -5.94% | -12.83% | -13.15% |
Class A2 | -6.12% | -13.89% | -8.66% | -5.72% | -14.16% | -6.91% |
Class B | -3.00% | -11.85% | -12.16% | -4.64% | -15.77% | -15.31% |
Class C | -5.49% | -16.90% | -17.57% | -4.93% | -17.71% | -16.74% |
Class D | -8.66% | -18.74% | -21.61% | -4.88% | -17.83% | -17.82% |
Overall | -5.54% | -14.18% | -14.31% | -5.13% | -15.89% | -14.47% |
Class | Sequence | Low QP(22-37) | High QP(27-42) | ||||
Y | U | V | Y | U | V | ||
A1 | Tango2 | -5.83% | -23.45% | -16.19% | -6.31% | -24.55% | -16.53% |
FoodMarket4 | -4.68% | -5.51% | -5.17% | -5.36% | -6.87% | -6.72% | |
Campfire | -3.50% | 6.28% | -4.86% | -4.82% | 0.62% | -7.61% | |
A2 | CatRobot | -7.12% | -19.21% | -13.93% | -7.37% | -21.58% | -15.29% |
DaylightRoad2 | -9.36% | -18.12% | -9.85% | -8.46% | -21.31% | -10.23% | |
ParkRunning3 | -2.14% | 0.36% | 0.46% | -2.51% | -2.71% | -2.97% | |
B | MarketPlace | -4.25% | -19.80% | -16.81% | -4.50% | -23.14% | -19.95% |
RitualDance | -5.60% | -12.32% | -17.50% | -5.49% | -14.46% | -20.09% | |
Cactus | -1.81% | 5.70% | 1.90% | -3.71% | -3.78% | -4.55% | |
BasketballDrive | –4.68% | -11.66% | -13.23% | -5.17% | -13.56% | -15.02% | |
BQTerrace | 5.23% | -9.75% | -5.47% | -0.73% | -12.68% | -11.02% | |
C | BasketballDrill | -5.33% | -15.65% | -14.58% | -5.41% | -20.41% | -18.15% |
BQMall | -5.93% | -16.98% | -19.81% | -6.56% | -20.93% | -23.52% | |
PartyScene | -7.57% | -11.56% | -11.93% | -7.71% | -16.19% | -15.59% | |
RaceHorses | -2.52% | -14.99% | -17.96% | -3.88% | -20.36% | -21.83% | |
D | BasketballPass | -8.20% | -22.08% | -23.58% | -8.47% | -25.90% | -24.94% |
BQSquare | -14.05% | -10.95% | -22.96% | -13.88% | -12.35% | -25.29% | |
BlowingBubbles | -6.54% | -13.82% | -12.98% | -6.89% | -17.12% | -15.55% | |
RaceHorses | -6.57% | -22.50% | -24.23% | -6.46% | -27.21% | -26.98% | |
Average | -5.29% | -12.42% | -13.09% | -5.98% | -16.03% | -15.89% |
II-C Loss Function
We employ the CharbonnierLoss [4] as the loss function, given the initial lossless frame and the final recovered frame , the loss is defined in (4):
(4) |
where is the number of samples used in each training iteration and is a small positive number. Considering the human eye is less sensitive to color details and more focused on brightness in images, the components of YUV are weighted separately, and the final loss function is in (5):
(5) |
with the specific weighting ratio being Y: U: V = 10: 1: 1.
III EXPERIMENTS AND RESULTS
III-A Implementation and Training Details
In this paper, we select 191 video sequences with YCbCr 4:2:0 format from each resolution of BVI-DVC [11] for training according to common test conditions (CTC). All sequences are compressed encoded by VTM-11.0-NNVC using the JVET-CTC [1] under RA configuration with five QP values: 22, 27, 32, 37 and 42.

After that, the output code stream is decoded to obtain the compressed video sequence. Our SC-HVPPNet is trained on an NVIDIA GeForce RTX 3060 GPU by PyTorch. We train our network for 31 epochs and iterations total using Adam with , . The learning rate is initialized as , which is reduced by half after every iteration. We set the CNN layer number of LFEM as 3.
Method | A1 | A2 | B | C | D | Overall |
---|---|---|---|---|---|---|
VTM-PP (VTM-4.0) [23] | -2.41% | -4.22% | -2.57% | -3.89% | -5.80% | -3.76% |
JVET-O0079 (VTM-5.0) [19] | -0.87% | -1.68% | -1.47% | -3.34% | -4.97% | -2.47% |
JVET-T0079 (VTM-10.0) [20] | -2.86% | -2.98% | -2.92% | -2.96% | -3.48% | -3.04% |
WCDANN (VTM-11.0-NNVC) [24] | -2.23% | -2.70% | -2.73% | -3.43% | -4.76% | -2.81% |
JVET-Z0144 (VTM-11.0-NNVC) [15] | -2.14% | -2.56% | -2.54% | -3.33% | -4.54% | -3.07% |
JVET-Z0082 (VTM-11.0-NNVC) [17] | -6.88% | -4.52% | -5.56% | -3.94% | -4.96% | -5.14% |
SC-HVPPNet (VTM-11.0-NNVC) (ours) | -4.67% | -6.21% | -2.22% | -5.34% | -8.84% | -5.29% |
Method | A1 | A2 | B | C | D | Overall |
---|---|---|---|---|---|---|
S-HVPPNet | -2.17% | -2.96% | -1.63% | -3.44% | -6.61% | -3.35% |
C-HVPPNet | -0.63% | -1.51% | -0.03% | -1.89% | -4.17% | -1.62% |
SC-HVPPNet | -4.67% | -6.21% | -2.22% | -5.34% | -8.84% | -5.29% |
Method | SQ-HVPPNet | P-HVPPNet | SC-HVPPNet |
Overall | -5.02% | -4.87% | -5.29% |
Note that only one network model is trained for all QP modes. During training, the video frames are randomly segmented into image blocks, and converted to YCbCr 4:4:4 format. For the testing session, we deploy the standard test sequences from JVET-CTC.
III-B Comparison with Other VPP Methods
According to JVET-CTC, Table. I illustrates the overall bitrate savings of the proposed SC-HVPPNet according to PSNR and MS-SSIM (MSIM) compared to VTM-11.0-NNVC under five QP values (22, 27, 32, 37, 42). It can be observed that SC-HVPPNet can greatly save the required bitrate under the same quality effect, with average BD-rates of YUV components -5.54%, -14.18%, and -14.31% on PSNR, respectively. And the results on the chroma component are much higher than that on the luminance component. As shown in Table. II, the coding gains in the higher QP ranges consistently outperform those in the lower QP ranges, especially in the chroma components, where the gains in higher QPs are 3.61% and 2.80% greater than those in the low QP range. As depicted in Fig. 3, the ringing effect of the hand fringe areas in RaceHorses and the blocking artifact and uneven edge pixels around the head in BasketballDrill are improved. Moreover, we compare SC-HVPPNet with 6 state-of-the-art VPP methods developed for the VVC RA configuration. These include VTM-PP [23], JVET-O0079 [19], JVET-T0079 [20], WCDANN [24], JVET-Z0144 [15] and JVET-Z0082 [17]. As shown in Table. III, the proposed SC-HVPPNet offers the best performance when compared to the other 6 methods, especially in lower-resolution videos, such as Class C and Class D.
(a)
(b)
(c)
Furthermore, as SC-HVPPNet trains a single model for all sequences and codec scenarios, it eliminates the requirement for model selection or extra coding parameters.
III-C Ablation Study
To analyze the impact of interaction manners between spatial and channel features, we explore three fusion manners for the spatial and channel modules of SC-HVPPNet: a) fusion in sequential, b) fusion in parallel, and c) fusion in hybrid interaction, as shown in Fig. 4. The spatial and channel modules correspond to SAFM and CAFM in Fig. 1, respectively. In the sequential approach, fused features of and pass through channel and spatial modules in order, while spatial and channel features calculate their respective attentions in a parallel way, which are then fused. We argue that spatial and channel attention are crucial in the restoration process for local and global features. Hence, we fuse and in a more fine-grained hybrid interaction manner in the spatial and channel domains, which is deployed in our proposed SC-HVPPNet. Moreover, we integrate the first two fusion methods into SC-HVPPNet, named SQ-HVPPNet and P-HVPPNet. The overall BD-rate results for PSNR on Y components in the low QP range are shown in Table.V. It can be observed that SC-HVPPNet offers the best performance compared to SQ-HVPPNet and P-HVPPNet. In addition, we retrain our SC-HVPPNet by using pure spatial (SAFM) and channel (CAFM) attention fusions, respectively (regarded as S-HVPPNet and C-HVPPNet). As shown in Table.IV, SC-HVPPNet increases the gain with an overall BD-rate of 1.93% and 3.66% compared with S-HVPPNet and C-HVPPNet. We conclude that SC-HVPPNet performs attractively by combining spatial and channel attention mechanisms.
IV CONCLUSION
In this paper, we explore the interaction between CNN and Transformer in the task of video post-processing, and propose a novel Spatial and Channel Hybrid-Attention Video Post-Processing Network (SC-HVPPNet), which can cooperatively exploit the image priors in both spatial and channel domains. The designed hybrid-attention structures enhance both local and global modeling capabilities in an efficient feature interaction way. Extensive experiments show that our SC-HVPPNet notably boosts video restoration quality in the VTM-11.0-NNVC RA configuration.
Acknowledgment
This work was supported in part by the Jiangsu Provincial Science and Technology Plan (BE2021086), the National Natural Science Foundation of China (NSFC) under grants 62302128 and 62076080, and the Postdoctoral Science Foundation of Heilongjiang Province of China (LBH-Z22175).
References
- [1] Jill Boyce, Karsten Suehring, Xiang Li, and Vadim Seregin. Jvet-j1010: Jvet common test conditions and software reference configurations, 2018.
- [2] B. Bross, J. Chen, S. Liu, and Y.-K. Wang. Versatile video coding (draft 10). in JVET-S2001. ITU-T ISO/IEC, 2020.
- [3] Hang Cao, Xiaowen Liu, Yida Wang, Yuting Li, and Weimin Lei. Enhanced down/up-sampling-based video coding using the residual compensation. In 2020 5th International Conference on Computer and Communication Systems (ICCCS), pages 286–290, 2020.
- [4] P. Charbonnier, L. Blanc-Feraud, G. Aubert, and M. Barlaud. Two deterministic half-quadratic regularization algorithms for computed imaging. In Proceedings of 1st International Conference on Image Processing, volume 2, pages 168–172 vol.2, 1994.
- [5] Chen Hui, Shengping Zhang, Wenxue Cui, Shaohui Liu, Feng Jiang, and Debin Zhao. Rate-adaptive neural network for image compressive sensing. IEEE Transactions on Multimedia, pages 1–15, 2023.
- [6] Jingyun Liang, Jie Cao, Guolei Sun, K. Zhang, Luc Van Gool, and Radu Timofte. Swinir: Image restoration using swin transformer. 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 1833–1844, 2021.
- [7] ChihHsuan Lin, YiHsin Chen, and WenHsiao Peng. Content-adaptive motion rate adaption for learned video compression. In 2022 Picture Coding Symposium (PCS), pages 163–167, 2022.
- [8] Huajun Liu, Fuqiang Liu, Xinyi Fan, and Dong Huang. Polarized self-attention: Towards high-quality pixel-wise regression. ArXiv, abs/2107.00782, 2021.
- [9] TsungJung Liu, PoWei Chen, and KuanHsien Liu. Lightweight image inpainting by stripe window transformer with joint attention to cnn. 2023.
- [10] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9992–10002, 2021.
- [11] Di Ma, Fan Zhang, and David R. Bull. Bvi-dvc: A training database for deep video compression. IEEE Transactions on Multimedia, 24:3847–3858, 2022.
- [12] P. Merkle, M. Winken, J. Pfaff, H. Schwarz, D. Marpe, and T. Wiegand. Intra-inter prediction for versatile video coding using a residual convolutional neural network. In 2022 IEEE International Conference on Image Processing (ICIP), pages 1711–1715, 2022.
- [13] Zhiliang Peng, Wei Huang, Shanzhi Gu, Lingxi Xie, Yaowei Wang, Jianbin Jiao, and Qixiang Ye. Conformer: Local features coupling global representations for visual recognition. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 357–366, 2021.
- [14] Zhanyuan Qi, Cheolkon Jung, Yang Liu, and Ming Li. Cnn-based post-processing filter for video compression with multi-scale feature representation. In 2022 IEEE International Conference on Visual Communications and Image Processing (VCIP), pages 1–5, 2022.
- [15] Zhanyuan Qi, Cheolkon Jung, Yang Liu, and Ming Li. Cnn-based post-processing filter for video compression with multi-scale feature representation. in JVET-Z0144, Teleconference, 2022.
- [16] Jonatan Samuelsson, Kiho Choi, Jianle Chen, and Dmytro Rusanovskyy. Mpeg-5 part 1: Essential video coding. SMPTE Motion Imaging Journal, 129(7):10–16, 2020.
- [17] M. Santamaria, R. Yang, F. Cricri, J. Lainema, R.G. Youvalari, H. Zhang, G. Rangu, H.R. Tavakoli, H. Afrabandpey, and M.M. Hannuksela. Content-adaptive neural network post-filter. in JVET-Z0082, Teleconference, 2022.
- [18] Gary J. Sullivan, Jens-Rainer Ohm, Woo-Jin Han, and Thomas Wiegand. Overview of the high efficiency video coding (hevc) standard. IEEE Transactions on Circuits and Systems for Video Technology, 22(12):1649–1668, 2012.
- [19] MingZe Wang, Shuai Wan, Hao Gong, and MingYang Ma. Attention-based dual-scale cnn in-loop filter for versatile video coding. IEEE Access, 7:145214–145226, 2019.
- [20] Yonghua Wang, Jingchi Zhang, Zhengang Li, Xing Zeng, Zhen Zhang, Diankai Zhang, Yunlin Long, and Ning Wang. Neural network-based in-loop filter for clic 2022. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1773–1776, 2022.
- [21] Wiegand, Thomas, Sullivan, J. Gary, Bjøntegaard, Gisle, Luthra, and Ajay. Overview of the h264/avc video coding standard. IEEE Transactions on Circuits & Systems for Video Technology, 2003.
- [22] Shilin Yan, Renrui Zhang, Ziyu Guo, Wenchao Chen, Wei Zhang, Hongyang Li, Yu Qiao, Zhongjiang He, and Peng Gao. Referred by multi-modality: A unified temporal transformer for video object segmentation. 2023.
- [23] Fan Zhang, Chen Feng, and David R. Bull. Enhancing vvc through cnn-based post-processing. In 2020 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6, 2020.
- [24] Hao Zhang, Cheolkon Jung, Dan Zou, and Ming Li. Wcdann: A lightweight cnn post-processing filter for vvc-based video compression. IEEE Access, 11:83400–83413, 2023.