¹¹institutetext: School of Electronic Science and Engineering, Nanjing University, P.R. China
¹¹email: mhshe@smail.nju.edu.cn, wdmao@smail.nju.edu.cn, shihh@smail.nju.edu.cn, zfwang@nju.edu.cn

S²R: Exploring a Double-Win Transformer-Based Framework for Ideal and Blind Super-Resolution

Minghao She Wendong Mao^(🖂) Huihong Shi Zhongfeng Wang^(🖂) Fellow, IEEE

Abstract

Nowadays, deep learning based methods have demonstrated impressive performance on ideal super-resolution (SR) datasets, but most of these methods incur dramatically performance drops when directly applied in real-world SR reconstruction tasks with unpredictable blur kernels. To tackle this issue, blind SR methods are proposed to improve the visual results on random blur kernels, which causes unsatisfactory reconstruction effects on ideal low-resolution images similarly. In this paper, we propose a double-win framework for ideal and blind SR task, named S²R, including a light-weight transformer-based SR model (S²R transformer) and a novel coarse-to-fine training strategy, which can achieve excellent visual results on both ideal and random fuzzy conditions. On algorithm level, S²R transformer smartly combines some efficient and light-weight blocks to enhance the representation ability of extracted features with relatively low number of parameters. For training strategy, a coarse-level learning process is firstly performed to improve the generalization of the network with the help of a large-scale external dataset, and then, a fast fine-tune process is developed to transfer the pre-trained model to real-world SR tasks by mining the internal features of the image. Experimental results show that the proposed S²R outperforms other single-image SR models in ideal SR condition with only 578K parameters. Meanwhile, it can achieve better visual results than regular blind SR models in blind fuzzy conditions with only 10 gradient updates, which improve convergence speed by 300 times, significantly accelerating the transfer-learning process in real-world situations. Codes will be found in https://github.com/berumotto-vermouth/S2R.¹¹1This work was supported in part by National Key R&D Program of China under Grant 2022YFB4400604.²²2Corresponding author: Wendong Mao and Zhongfeng Wang

Keywords:

Super-Resolution, Image Processing, Blind Super-Resolution, Transformer

1 Introduction

Super-Resolution (SR) is proposed to increase the resolution of low-quality images and enhance their clarity. As a fundamental low-level vision task, single image super-resolution (SISR), which aims to recover plausible high-resolution (HR) images from their counterpart low-resolution (LR) images, has attracted increasing attention. With the remarkable success of convolutional neural networks (CNNs), various deep learning based methods with different network architectures and training strategies have been proposed for SISR and achieved prominent visual results. Most of them are optimized upon a large number of external training dataset, which uses the fixed bicubic operation to downsample HR images for obtaining LR images and constructing paired training datasets. For imitating low-resolution images in ideal conditions, in regular tasks, the LR images are generated by down-sampling in the noise-free bicubic condition. In this way, several previous deep learning based methods [23, 41, 8, 30, 20] show excellent performance in ideal conditions.

In real-world situations, noise and blur are inevitable and influenced by actual equipment, such as camera lens condition, shooting conditions, which is sometimes different from the ideal one. And these blur kernels affected by real-world conditions are unpredictable in most cases. Thus, to imitate the actual degradation blur condition of real-world images, most blind SR methods [2, 12, 35] assume that the LR input image is obtained by down-sampling the counterpart HR image by an unpredictable blur kernel. As revealed in [9], learning-based methods will suffer severe performance drops when the blur kernels in the test phase and the training phase are inconsistent. This kind of kernel mismatch will introduce undesired artifacts to output images. Thus, the problem with unpredictable blur kernels, also known as blind SR, has limited the application of some deep learning based SR methods in real-world situations. To solve this limitation, several novel methods have been proposed. For instance, some blind SR methods [2, 12, 35] are model-based, which introduce prior knowledge to the deep learning area and usually involve complicated optimization procedures. These methods do not calculate the SR kernel directly, but assume that SR networks are robust and transferable to variations in the downsampling kernels. [29] exploited the recurrence of small image patches across scales of a single image to estimate the unknown SR kernel directly from the LR input image. However, it fails when the SR scale is larger than 2, and the runtime is very long. ZSSR [32] does not adopt the regular training process on the external datasets and trains from scratch for each test image, so that specific models pay more attention to information in the picture. Nevertheless, the runtime is greatly increased, limiting its application in the real scene. The current challenge is that in the above SR models, no one can show ideal performance simultaneously on both SISR and blind SR tasks. Thus, it’s urgent to propose an SR method to solve this challenge.

In this paper, we propose a novel framework S²R. In it, inspired by the great success of transformer in computer vision tasks, a light-weight transformer-based model is proposed to achieve excellent performance in the ideal LR condition, and a novel coarse-to-fine training strategy is performed to boost the scalability for guaranteeing the real-world applications and achieving fast convergence by extremely few gradient updates. The detailed contributions are shown as follow:

•

To achieve excellent performance in ideal and blind SR conditions, we propose a new framework S²R, containing a light-weight transformer-based model where the excellent visual results can be achieved in ideal conditions, and a novel coarse-to-fine training strategy where the proposed model can be extended to the real-world applications.
•

For the network architecture, we propose a light-weight transformer-based model for super resolution, which enhances the representational power by combining several light-weight efficient blocks, achieving good performance in ideal SISR tasks.
•

To boost the scalability of the proposed model in real-world situations, we further propose a new coarse-to-fine training strategy, which is to pre-train the proposed model on a large-scale external dataset, and then, perform transfer-learning on a new internal real-world dataset to take advantage of external and internal learning, realizing better performance and fast-adaption for real-world applications.
•

The experiment result shows the superiority of the proposed framework. For SISR tasks, compared to sort-of-the-art (SOTA) transformer-based models, our methods can achieve comparable performance with minimum number of parameters parameters. And for blind SR tasks, our methods can achieve better performance and clearer outputs than other blind SR models. Moreover, the proposed framework reduces the number of backpropagation gradient updates by 300 times.

2 Related Work

2.1 Single Image Super Resolution

SISR is based on the image degradation model as

I_{LR}=(I_{HR}\ast k_{s})\downarrow_{s},

(1)

where I_HR, I_LR, k_s, ${s}$ denote HR, LR image, SR kernel, and scaling factor, respectively. Recently, CNN-based methods [8, 30, 20, 18, 33] have demonstrated impressive performance in SR tasks. SRGAN [21] first introduced residual blocks into SR networks. EDSR [24] also used a deep residual network for training SR model but removed the unnecessary batch normalization layer in the residual block. Zhang et al. [42] achieved better performance than EDSR by introducing the channel attention to residual block to form a very deep network. Haris et al. [11] proposed deep back-projection networks to exploit iterative up and down sampling layers, providing an error feedback mechanism for projection errors at each stage. SRMD [40] proposed a stretching strategy to integrate kernel and noise information to cope with multi-degradation kernels in an SR network. The breakthrough of transformer networks in natural language processing (NLP) inspires researchers to use self-attention (SA) in vision tasks. Thus, the transformer-based SR models come into being. SwinIR [23] adapted the Swin Transformer [25] to image restoration, which combines the advantage of both CNNs and transformers. ELAN [41] based on SwinIR [23], removed redundant designs such as masking strategy and relative position encoding to make the network more slim. Reducing the number of parameters of transformer-based methods and alleviating performance drops is a feasible direction. Based on this, we introduce some light-weight blocks to reduce parameters further, and boost the representational power, achieving good visual results.

2.2 Blind Super-Resolution

Compared to SISR, blind SR assumes that the degradation blur kernels are unpredictable. In real-world applications, the SR kernel of images is influenced by the sensor optics. In recent years, some blind SR methods [35, 12] introduce prior knowledge into the deep learning area. However, these methods are to make their models more robust to variations, rather than explicitly calculating the SR kernel. In contrast, [29] estimates the kernel based on the recurrence of small image patches (5 $\times$ 5, 7 $\times$ 7) across scales of a single image but fails for SR scale factors larger than 2. KernelGAN [3] based on Interal-GAN [31] estimates the SR kernel that best preserves the distribution of patches across scales of the LR image. ZSSR [32] trains a small full convolution network for each image from scratch to learn image-specific internal structure, rather than adapting the training process to big data. However, it drastically increases the runtime for thousands of backpropagation gradient updates at test time. In contrast, the proposed methods can significantly reduce the runtime with the help of our pre-train model on a large-scale external dataset.

3 The Proposed S²R Framework

In this section, we introduce a double-win framework for ideal and blind SR (S²R), which consists of a light-weight transformer-based model (S²R transformer) and a novel coarse-to-fine training strategy, as shown in Fig. 1. With the proposed training strategy, the proposed model can achieve significant performance with a few backpropagation gradient updates and great scalability for the real-world applications. The learning process of the framework is mainly divided to two stages, coarse-level learning on the general dataset and fast fine-tune on real-world images. The first stage is to pre-train the proposed model on a large scale external dataset to guarantee its great visual results on ideal LR conditions and fast-adaption when the proposed model is extended to the real-world situations. The second stage is to boost the scalability of our model, to extend the real-world applications.

Refer to caption — (a) The proposed S²R transformer

3.1 Network Architecture

As shown in Fig. 1(a), the proposed S²R transformer contains three modules, a shallow feature extraction module (SF), a deep feature extraction module (DF) and a high-quality image reconstruction module (RC). Specifically, given a low-resolution (LR) input I ${}_{LR}\in\mathbb{R^{3{\times}H{\times}W}}$ , where H and W are the height and width of the LR image, respectively, we first use the shallow feature extraction module denoted by H ${}_{SF}(\cdot)$ , which only consists of a single 3 $\times$ 3 convolution, to extract local feature I ${}_{local}\in\mathbb{R^{C{\times}H{\times}W}}$ , the deep feature extraction module consists of a mobile bottleneck convolution (MBConv), denoted by H_mbconv, N cascaded efficient long-range attention blocks (ELAB), denoted by H_ELAB, a deformable convolution [6] (Deform Conv), denoted by H_deform and a 3 $\times$ 3 convolution:

I_{local}=H_{SF}(I_{LR}),

(2)

where C is the feature channel number.

I_local then goes to the deep feature extraction module, denoted by H ${}_{DF}(\cdot)$ . That is:

I_{deep}=H_{DF}(I_{local}),

(3)

where I ${}_{deep}\in\mathbb{R^{C{\times}H{\times}W}}$ denotes the output of the deep feature extraction module. More specifically, intermediate features I_mbconv, I_deform, I₁, I₂, $\dots$ , I_N, and the output deep feature I_deep are extracted block by block as

\centering\begin{array}[]{rcl}I_{mbconv}&=&H_{mbconv}(I_{local})\\ I_{i}&=&H_{ELAB_{i}}(I_{i-1}),\qquad i=1,2,\dots,N\\ I_{deform}&=&H_{deform}(I_{N})\\ I_{deep}&=&H_{CONV}(I_{deform}),\end{array}\@add@centering

(4)

where H ${}_{ELAB_{i}}(\cdot)$ denotes the i-th ELAB and H_CONV is the last convolutional layer. Finally, taking I_deep and I_local as inputs, the HR image I_HQ is reconstructed as:

I_{RHQ}=H_{RC}(I_{deep}+I_{local}),

(5)

where H_RC is the reconstruction module. Here we choose a sub-pixel convolution layer [30] to upsample the feature. The proposed model can be optimized with the commonly used loss functions for SR, such as L₂ [7] and L₁ [20, 24, 43]. For simplicity, we choose L₁ as our loss function by minimizing the L₁ pixel loss.

3.1.1 ELAB

Despite the great success of transformer on computer vision tasks, it greatly suffers from explosive parameters and high computational complexity. Thus, inspired by [41], we introduce ELAB to achieve a good performance in the case of a relatively low number of parameters. In ELAB, the regular multi-layer perception is replaced with two shift-conv [36] with a simple ReLU activation. It can help enlarge the receptive field of the proposed model preliminarily, while sharing the same arithmetic complexity as two cascaded 1 $\times$ 1 convolutions. Moreover, to significantly reduce parameters, the accelerated attention mechanism (ASA) is utilized. In the self-attention calculation procedure, three independent 1 $\times$ 1 convolutions, denoted by $\theta,\phi,\psi$ , are employed to map the input feature X into three different feature maps. In ELAB, $\theta$ is set the same as $\phi$ , which can save 1 $\times$ 1 convolution in each self-attention. For compensating for performance drops caused by the reduced number of parameters, instead of the regular self-attention, the group-wise multi-scale self-attention (GMSA) is used. The computational complexity of the regular window-based self-attention [25] is determined by the window size M. Thus, in GMSA, the input feature map is divided into three groups, and then set a flexible window size for each group. In this way, the relative position bias in regular transformer-based SISR models, which makes models fragile to the resolution change [38], can be removed. Based on this, the proposed model can also extract information with different scales and be more flexible to the input, which makes it feasible to extend the proposed model to real-world applications. Finally, batch normalization (BN) [16] is utilized to replace layer normalization (LN), accelerating the calculation by avoiding fragmenting the calculation into many element-wise inefficient operations.

3.1.2 MBConv

For further performance improvement, inspired by the intuition that the bottleneck actually contains all the necessary information, MBConv is introduced to enhance the representational power. It is based on depthwise separable convolution [5], which inherently has fewer parameters than the full convolution. Thus, it will not bring lots of parameters. Furthermore, a squeeze-and-excitation optimization [13] is added to enlarge the receptive field further. For its inverted design, it is more memory efficient which will further decrease running time and parameter count for bottleneck convolution. Therefore, this light-weight block can help the proposed model get a larger receptive field and better performance. Recent studies [10, 37] suggest that using convolutions in early stages benefits the performance of Vision Transformers. Thus, we follow this design to put MBConv in early stages.

3.1.3 Deform Conv

To extract irregular structural features, a Deform Conv is added. It is based on the idea of augmenting the spatial sampling in the modules with additional offsets and learning the offsets from the target tasks, without introducing additional supervision. Thus, it is lightweight on its own, without sacrificing computing resources to trade off the performance. Furthermore, as the receptive field is further enlarged, it can facilitate generalization to new tasks possessing unpredictable geometric transformations, improving the performance in SISR tasks, and supporting subsequently migrating the proposed model into the real-world applications.

3.2 The Proposed Coarse-to-Fine Training Strategy

To guarantee impressive performance in ideal SR conditions while extending the proposed model to applications in the real world, we propose a coarse-to-fine training strategy. Fig. 1(b) shows the pipeline of the proposed fast training strategy. It consists of two stages, coarse-level learning on the general dataset and fast fine-tune on real-world images.

3.2.1 Coarse-Level Learning on the General Dataset

Considering that most blind SR methods directly train on the input test dataset, without fully exploiting the large-scale external dataset. Thus, as shown on the left side of the dotted line in Fig. 1(b), we firstly pre-train S²R transformer on the external dataset (noise-free “bicubic" dataset). This procedure can guarantee the good performance on SISR tasks, even if the proposed model is migrated to the real-world applications.

3.2.2 Fast Fine-Tune on Real-World Images

Avoiding deviating from the goal of calculating the blur kernel to deal with blind SR tasks, rather than just make the proposed model robust to variations in the degradation LR kernel, it’s necessary to estimate the SR kernel that best preserves the distribution of patches across scales of the LR image, which can compensate for the performance drop caused by the blur kernel mismatch. Recent studies [3, 29, 44] have exploited the kernel estimation well, we estimate the blur kernel by the method in [3], where its Generator is trained to generate a downscaled image of the LR test image, such that its Discriminator cannot distinguish between the patch distribution of the downscaled image and the origin.

The diversity of the LR-HR relations within a single image is significantly small, related to noise and blur kernel. Thus, as shown on the right side of the dotted line in Fig. 1(b), given an LR real-world test image I_LR, we use the estimated blur kernel ${K}$ to generate the lower resolution image I_LRson, and synthesize them as a data pair LR-LR_son. Then, we use the pre-trained model generated in Coarse-Level Learning on the General Dataset to generate the corresponding high-resolution version I_SR of the I_LRson, and optimize to learn the residual between I_SR and I_LR. With the help of training with the external dataset and the proposed light-weight transformer-based model, the runtime can be drastically reduced. Thus, only 10 iterations are needed. Furthermore, a learning-rate update policy is performed, where we start with a learning rate of $\beta_{0}$ , and then, change to $\beta_{1}$ when the iteration is not larger than 4, and keep $\beta_{2}$ until the end. Finally, the method in [24], where 8 different outputs for several rotations and flips of the test image I are generated and combined, is utilized to improve the performance on images in real-world situations. Moreover, their mean is replaced with the median of these 8 outputs. We further combine it with the back-projection of [17], so that each of the 8 output images goes through multiple iterations of the back-projection and the final correction of the median image can also be done through the back projection.

Compared to conventional blind SR methods, such as ZSSR [32], and conventional SISR methods, such as SwinIR [23], the advantages of the proposed framework are twofold:

•

In terms of network architecture, compared to ZSSR, whose backbone is a fully convolutional network, our model is based on transformer blocks which have more powerful learning and representation, thus achieving better performance than ZSSR.
•

In terms of training strategy, ZSSR trains from scratch on the small dataset for implementing the SR task, while we pre-train the proposed model on the external dataset, and then, transfer-learning is performed on the proposed model to extend our model to a new SR task. It makes our model fast-adaption on downstream tasks and real-world applications and further boosts the performance of the proposed model.

4 Experiments $\&$ Results

4.1 Training Details

For the pre-trained stage, we employ the DIV2K dataset [34] with 800 training images to train our model, which consists of 10 ELABs with 60 channels, an MBConv, and a Deformable Conv. The model is trained using the ADAM [19] optimizer with the learning rate $\beta$ =2 $\times$ 10^-4. Following ELAN [41], our multi-scale window sizes are set: 4 $\times$ 4, 8 $\times$ 8 and 16 $\times$ 16. The model training is conducted by Pytorch on NVIDIA 2080Ti GPUs. For the transfer-learning stage, our model is fine-tuned for 10 iterations, using the ADAM [19] optimizer with the learning rate $\beta_{0}$ =2 $\times$ 10^-2, and then, change to $\beta_{1}$ =1 $\times$ 10^-2 when the iteration is not larger than 4, and keep $\beta_{2}$ =5 $\times$ 10^-3 until the end.

4.2 Evaluations on Ideal Super-Resolution Dataset

We evaluate our model with several popular SISR models, including CNN-based models CARN [1], IMDN [15], LAPAR-A [22], LatticeNet [26], and transformer-based models SwinIR-light [23] and ELAN-light [41] on popular benchmarks: Set5 [4], Set14 [39], BSD100 [27], Urban100[14] and Manga109[28], for performance comparison. Peak Signal-to-Noise Ratio (PSNR) and Structural SIMilarity (SSIM) are used as evaluation metrics, which are calculated on the Y channel after converting RGB to YCbCr format.

4.2.1 Quantitative comparison

The quantitative indexes of different methods are reported in Table 1. The transformer-based methods outperform many CNN-based methods, by exploiting the self-similarity of images. However, among them, SwinIR-light [23] suffers from its high number of parameters, which will place a heavy burden on the deployment, limiting its applications in real-world situations. Compared to it, ELAN-light [41] improves the performance and accelerates the inference for real-world applications. Among all light-weight CNN-based SR models, parameters of the proposed model are a little more than LAPAR-A [22], while the PSNR/SSIM of the proposed model outperforms all of them greatly. Among all light-weight transformer-based SR models, our model has the lowest number of parameters 578K. Compared to the SOTA model ELAN [41], the PSNR/SSIM of the proposed model is comparable to it with almost identical visuals as shown in Fig. 2(e) and Fig. 2(f). It’s worth mentioning that our model outperforms ELAN significantly on real-world tasks, detailed can be seen in Section 4.3.

Table 1: Performance comparison of different light-weight SR models in ideal SR conditions. #Params indicates the total number of network parameters. The efficiency proxies (#Params) is measured under the setting of upscaling SR images to 1280

\times

720 resolution. Best and second-best PSNR/SSIM indexes are marked in red and blue colors, respectively.

Scale

Model

#Params

(K)

Set5 [4]

PSNR/SSIM

Set14 [39]

PSNR/SSIM

BSD100 [27]

PSNR/SSIM

Urban100[14]

PSNR/SSIM

Manga109[28]

PSNR/SSIM

\times

CARN [1]

1592

37.76/0.9590

33.52/0.9166

32.09/0.8978

31.92/0.9256

38.36/0.9765

\times

IMDN [15]

694

38.00/0.9605

33.63/0.9177

32.19/0.8996

32.17/0.9283

38.88/0.9774

\times

LAPAR-A [22]

548

38.01/0.9605

33.62/0.9183

32.19/0.8999

32.10/0.9283

38.67/0.9772

\times

LatticeNet [26]

756

38.06/0.9607

33.70/0.9187

32.20/0.8999

32.25/0.9288

— / —

\times

SwinIR-light [23]

878

38.14/0.9611

33.86/0.9206

32.31/0.9012

32.76/0.9340

39.12/0.9783

\times

ELAN-light [41]

582

38.17/0.9611

33.94/0.9207

32.30/0.9012

32.76/0.9340

39.11/0.9782

\times

S²R transformer (Ours)

578

38.15/0.9611

33.88/0.9206

32.30/0.9013

32.61/0.9333

39.01/0.9780

4.2.2 Qualitative comparison

We then qualitatively compare the SR quality of different light-weight models. For a more prominent comparison, we compare the $\times$ 4 SR results on the example image instead of $\times$ 2 SR, as shown in Fig. 2. From Fig. 2(b), Fig. 2(c) and Fig. 2(d), we can see that SwinIR-light [23] and all the CNN-based models result in blurry and distorted edges in ideal SR conditions. However, compared to Fig. 2(g), Fig. 2(e) and Fig. 2(f) show that the transformer-based models can restore more accurate and clear structures and have the potential to recover clear and sharp edges in ideal SR conditions, demonstrating the effectiveness of introducing self-attention. It’s worth mentioning that the visuals of the proposed model are almost comparable to ELAN-light [41], while parameters of the proposed model are fewest among the light-weight transformer-based SR models.

4.3 Evaluations on Real-World Images with Unpredictable Blur Kernels

For imitating the real-world images, we randomly generate various blur kernels to characterize different fuzzy conditions of real-world images caused by various unpredictable factors, and then, we evaluate the proposed framework on these blur kernel conditions. We assume four scenarios: severe aliasing, isotropic Gaussian, anisotropic Gaussian, and isotropic Gaussian, followed by bicubic subsampling. PSNR and SSIM are used as evaluation metrics, which are calculated on the Y channel after converting RGB to YCbCr format.

•

g ${}^{d}_{0.2}$ : isotropic Gaussian blur kernel with width $\lambda$ = 0.2 followed by direct subsampling.
•

g ${}^{d}_{2.0}$ : isotropic Gaussian blur kernel with width $\lambda$ = 2.0 followed by direct subsampling.

•

g ${}^{d}_{ani}$ : anisotropic Gaussian with widths $\lambda_{1}$ = 4.0 and $\lambda_{2}$ = 1.0 with $\Theta$ = -0.5 from

\sum=\left[\begin{array}[]{cc}cos(\Theta)&-sin(\Theta)\\ sin(\Theta)&cos(\Theta)\end{array}\right]\left[\begin{array}[]{cc}\lambda_{1}&0\\ 0&\lambda_{2}\end{array}\right]\left[\begin{array}[]{cc}cos(\Theta)&sin(\Theta)\\ -sin(\Theta)&cos(\Theta)\end{array}\right]

(6)

followed by direct subsampling.

•

g ${}^{b}_{1.3}$ : isotropic Gaussian blur kernel with width $\lambda$ = 1.3 followed by bicubic subsampling.

4.3.1 Quantitative comparison

The results on various kernels are shown in Table 2. ZSSR [32] is a classical blind SR model, when it comes to the LR input images, it trains from scratch for each LR input image to recover the counterpart HR image. However, it requires thousands of gradient updates, which will increase the run time significantly. KernelGAN [3] estimates the blur kernel, and then, generates the HR outputs with the help of ZSSR [32]. The performance drop caused by the inaccuracy of blur kernel estimation will be further amplified by ZSSR [32]. Our methods show better PSNR/SSIM than the regular blind SR methods, with the help of the proposed pre-trained transformer-based model, which enhances the representational power and accelerate the inference. Moreover, with the proposed fast coarse-to-fine training strategy, the number of gradient updates can be reduced to 10, from the origin 3000 in ZSSR [32] and KernelGAN [3], greatly reducing the running time.

Table 2: The average PSNR/SSIM results on various kernels with

\times

2 in real-world conditions. Best and second-best PSNR/SSIM indexes are marked in red and blue colors, respectively.

Scale	Kernel	Dataset	ZSSR [32]	KernelGAN [3]	ELAN [41]	SwinIR [23]	S²R (Ours)
$\times$ 2	g ${}^{d}_{0.2}$	Set5	28.45/0.8592	21.11/0.5903	28.51/0.8694	28.46/0.8686	28.47/0.8684
$\times$ 2	g ${}^{d}_{2.0}$	Set5	29.16/0.8602	26.17/0.8074	29.17/0.8612	29.17/0.8613	30.06/0.8890
$\times$ 2	g ${}^{d}_{ani}$	Set5	28.41/0.8374	25.74/0.7802	28.44/0.8392	28.43/0.8390	28.93/0.8583
$\times$ 2	g ${}^{b}_{1.3}$	Set5	$\color[rgb]{0,0,1}{31.59}$ /0.8991	29.17/0.8811	31.54/0.9000	31.58/0.9005	33.37/0.9239

Furthermore, from Table 3, we find that transformer-based models also show relatively good performance in real-world tasks, further proving the importance of introducing self-attention to extract features. At the g ${}^{d}_{0.2}$ condition, the blur kernel width is relatively small, resulting in a relatively incorrect blur kernel estimation, subsequently, amplified by the proposed coarse-to-fine training strategy, leading to a result that is a little bit worse than ELAN [41]. Besides, the proposed framework shows better PSNR/SSIM than others.

Table 3: The average PSNR/SSIM results on “bicubic" downsampling scenario with

\times

2 on benchmarks. Those with better PSNR/SSIM will be bolded.

scale	Dataset	ZSSR [32]	S²R (Ours)
$\times$ 2	Set5 [4]	36.93/0.9554	38.15/0.9611
$\times$ 2	BSD100 [27]	31.43/0.8901	32.30/0.9013
$\times$ 2	Urban100[14]	29.34/0.8941	32.61/0.9333

4.3.2 Qualitative comparison

We then qualitatively compare the SR quality of different blind SR methods in real-world conditions with the kernel g ${}^{b}_{1.3}$ . As shown in Fig. 3, the visual results outperform other methods. Our method can better mine the internal features of the image, obtaining sharper edges and higher contrast results. Compared with Fig. 3(a), Fig. 3(b), Fig. 3(c) and Fig. 3(d), the results of ours are far clearer than them, and closest to GT.

4.4 Ablation Study

In order to further verify the necessity of the modules in our network and our proposed training recipe, we conduct ablation experiments on the corresponding dataset.

4.4.1 For Network Architecture

•

This setting only contains ELAB in the deep feature module, so it can be used to observe the effect of ELAB.
•

This setting contains ELAB and MBConv in the deep feature module, so it can be used to observe the effect of MBConv.
•

This setting is full model, containing ELAB, MBConv and Deform Conv, so it can be used to verify the effect of Deformable Conv.

The results are shown in Table 4. Benefit from the efficient design of ELAB, compared with conventional SR models, such as SwinIR [23], our model achieves comparable performance with fewer parameters. Moreover, based on the intuition that the bottlenecks contain all the necessary information, MBConv compensates for the performance drop caused by the reduced number of parameters. Finally, Deformable Conv enlarges the receptive field, further improving the task performance effectively.

Table 4: Ablation study on network design for S²R transformer. Those with better PSNR/SSIM will be bolded.

Scale	Model	Different components			#Params(K)	Set5 [4]	Set14 [39]	BSD100 [27]	Urban100[14]	Manga109[28]
Scale	Model	ELAB	MBConv	Deform Conv	#Params(K)	PSNR/SSIM	PSNR/SSIM	PSNR/SSIM	PSNR/SSIM	PSNR/SSIM
$\times$ 2	S²R transformer	✓			523.45	38.12/0.9609	33.80/0.9203	32.27/0.9009	32.45/0.9325	38.85/0.9777
$\times$ 2	S²R transformer	✓	✓		562.94	38.14/0.9610	33.83/0.9205	32.28/0.9010	32.50/0.9330	38.95/0.9780
$\times$ 2	S²R transformer	✓	✓	✓	578	38.15/0.9611	33.88/0.9206	32.30/0.9013	32.61/0.9333	39.01/0.9780

4.4.2 For Training Strategy

•

In this setting, we directly use the proposed model to deal with the real-world tasks without the proposed fast training strategy.
•

In this setting, the proposed model is pre-trained, and then fine-level transfer learning is proposed to adapt to the real-world blur kernel. This setting is used to verify the effect of the proposed coarse-to-fine training strategy.

The key point of our training recipe is to fine-tune on several other blur kernels after pre-training our model. Our purpose is to broaden the scalability of our model and reduce the runtime simultaneously. With the help of self-similarity, transformer-based models benefit the performance on blind SR tasks. Moreover, the per-trained model avoids up to 3000 backpropagation gradient updates, reducing the runtime significantly.

Table 5: Ablation study on training strategy for S²R transformer. Those with better PSNR/SSIM will be bolded.

Scale	Model	Trainging Strategy	Set5(g ${}^{d}_{0.2}$ )	Set5(g ${}^{d}_{2.0}$ )	Set5(g ${}^{d}_{ani}$ )	Set5(g ${}^{b}_{1.3}$ )
Scale	Model	Trainging Strategy	PSNR/SSIM	PSNR/SSIM	PSNR/SSIM	PSNR/SSIM
$\times$ 2	S²R transformer		28.36/0.8664	29.17/0.8614	28.43/0.8391	31.58/0.9007
$\times$ 2	S²R transformer	✓	28.47/0.8684	30.06/0.8890	28.93/0.8583	33.37/0.9239

5 Conclusion

In this paper, we proposed a novel framework S²R, which consists of a light-weight transformer-based model and a novel coarse-to-fine training strategy, achieving excellent performance and good visual results in both SISR and blind SR tasks. On one hand, some light-weight blocks are utilized in the proposed model to reduce the number of parameters further while maintaining a great performance. On the other hand, the proposed training strategy broadens the scalability of the proposed model in real-world situations. From our extensive experiments, regardless of whether in real-world applications or in ideal conditions, the proposed model outperforms previous methods in terms of visual results with the lowest number of parameters 578K. Moreover, the proposed framework accelerates the transfer-learning process in real-world situations by extremely reducing the number of backpropagation gradient updates, Compared to ZSSR [32], the reduction is as high as 300 times.

References

[1] Ahn, N., Kang, B., Sohn, K.A.: Fast, accurate, and lightweight super-resolution with cascading residual network. In: Proceedings of the European conference on computer vision (ECCV). pp. 252–268 (2018)
[2] Begin, I., Ferrie, F.: Blind super-resolution using a learning-based approach. In: Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004. vol. 2, pp. 85–89. IEEE (2004)
[3] Bell-Kligler, S., Shocher, A., Irani, M.: Blind super-resolution kernel estimation using an internal-gan. Advances in Neural Information Processing Systems 32 (2019)
[4] Bevilacqua, M., Roumy, A., Guillemot, C., Alberi-Morel, M.L.: Low-complexity single-image super-resolution based on nonnegative neighbor embedding (2012)
[5] Chollet, F.: Xception: Deep learning with depthwise separable convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1251–1258 (2017)
[6] Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convolutional networks. In: Proceedings of the IEEE international conference on computer vision. pp. 764–773 (2017)
[7] Dong, C., Loy, C.C., He, K., Tang, X.: Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence 38(2), 295–307 (2015)
[8] Dong, C., Loy, C.C., Tang, X.: Accelerating the super-resolution convolutional neural network. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14. pp. 391–407. Springer (2016)
[9] Efrat, N., Glasner, D., Apartsin, A., Nadler, B., Levin, A.: Accurate blur models vs. image priors in single image super-resolution. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2832–2839 (2013)
[10] Graham, B., El-Nouby, A., Touvron, H., Stock, P., Joulin, A., Jégou, H., Douze, M.: Levit: a vision transformer in convnet’s clothing for faster inference. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 12259–12269 (2021)
[11] Haris, M., Shakhnarovich, G., Ukita, N.: Deep back-projection networks for super-resolution. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1664–1673 (2018)
[12] He, H., Siu, W.C.: Single image super-resolution using gaussian process regression. In: CVPR 2011. pp. 449–456. IEEE (2011)
[13] Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7132–7141 (2018)
[14] Huang, J.B., Singh, A., Ahuja, N.: Single image super-resolution from transformed self-exemplars. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 5197–5206 (2015)
[15] Hui, Z., Gao, X., Yang, Y., Wang, X.: Lightweight image super-resolution with information multi-distillation network. In: Proceedings of the 27th acm international conference on multimedia. pp. 2024–2032 (2019)
[16] Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning. pp. 448–456. pmlr (2015)
[17] Irani, M., Peleg, S.: Improving resolution by image registration. CVGIP: Graphical models and image processing 53(3), 231–239 (1991)
[18] Kim, J., Lee, J.K., Lee, K.M.: Deeply-recursive convolutional network for image super-resolution. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1637–1645 (2016)
[19] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
[20] Lai, W.S., Huang, J.B., Ahuja, N., Yang, M.H.: Deep laplacian pyramid networks for fast and accurate super-resolution. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 624–632 (2017)
[21] Ledig, C., Theis, L., Huszár, F., Caballero, J., Cunningham, A., Acosta, A., Aitken, A., Tejani, A., Totz, J., Wang, Z., et al.: Photo-realistic single image super-resolution using a generative adversarial network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4681–4690 (2017)
[22] Li, W., Zhou, K., Qi, L., Jiang, N., Lu, J., Jia, J.: Lapar: Linearly-assembled pixel-adaptive regression network for single image super-resolution and beyond. Advances in Neural Information Processing Systems 33, 20343–20355 (2020)
[23] Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., Timofte, R.: Swinir: Image restoration using swin transformer. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 1833–1844 (2021)
[24] Lim, B., Son, S., Kim, H., Nah, S., Mu Lee, K.: Enhanced deep residual networks for single image super-resolution. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops. pp. 136–144 (2017)
[25] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 10012–10022 (2021)
[26] Luo, X., Xie, Y., Zhang, Y., Qu, Y., Li, C., Fu, Y.: Latticenet: Towards lightweight image super-resolution with lattice block. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII 16. pp. 272–289. Springer (2020)
[27] Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001. vol. 2, pp. 416–423. IEEE (2001)
[28] Matsui, Y., Ito, K., Aramaki, Y., Fujimoto, A., Ogawa, T., Yamasaki, T., Aizawa, K.: Sketch-based manga retrieval using manga109 dataset. Multimedia Tools and Applications 76, 21811–21838 (2017)
[29] Michaeli, T., Irani, M.: Nonparametric blind super-resolution. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 945–952 (2013)
[30] Shi, W., Caballero, J., Huszár, F., Totz, J., Aitken, A.P., Bishop, R., Rueckert, D., Wang, Z.: Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1874–1883 (2016)
[31] Shocher, A., Bagon, S., Isola, P., Irani, M.: Ingan: Capturing and retargeting the" dna" of a natural image. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4492–4501 (2019)
[32] Shocher, A., Cohen, N., Irani, M.: “zero-shot” super-resolution using deep internal learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3118–3126 (2018)
[33] Tai, Y., Yang, J., Liu, X.: Image super-resolution via deep recursive residual network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3147–3155 (2017)
[34] Timofte, R., Agustsson, E., Van Gool, L., Yang, M.H., Zhang, L.: Ntire 2017 challenge on single image super-resolution: Methods and results. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops. pp. 114–125 (2017)
[35] Wang, Q., Tang, X., Shum, H.: Patch based blind image super resolution. In: Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1. vol. 1, pp. 709–716. IEEE (2005)
[36] Wu, B., Wan, A., Yue, X., Jin, P., Zhao, S., Golmant, N., Gholaminejad, A., Gonzalez, J., Keutzer, K.: Shift: A zero flop, zero parameter alternative to spatial convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 9127–9135 (2018)
[37] Xiao, T., Singh, M., Mintun, E., Darrell, T., Dollár, P., Girshick, R.: Early convolutions help transformers see better. Advances in Neural Information Processing Systems 34, 30392–30400 (2021)
[38] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems 34, 12077–12090 (2021)
[39] Zeyde, R., Elad, M., Protter, M.: On single image scale-up using sparse-representations. In: Curves and Surfaces: 7th International Conference, Avignon, France, June 24-30, 2010, Revised Selected Papers 7. pp. 711–730. Springer (2012)
[40] Zhang, K., Zuo, W., Zhang, L.: Learning a single convolutional super-resolution network for multiple degradations. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3262–3271 (2018)
[41] Zhang, X., Zeng, H., Guo, S., Zhang, L.: Efficient long-range attention network for image super-resolution. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XVII. pp. 649–667. Springer (2022)
[42] Zhang, Y., Li, K., Li, K., Wang, L., Zhong, B., Fu, Y.: Image super-resolution using very deep residual channel attention networks. In: Proceedings of the European conference on computer vision (ECCV). pp. 286–301 (2018)
[43] Zhang, Y., Tian, Y., Kong, Y., Zhong, B., Fu, Y.: Residual dense network for image super-resolution. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2472–2481 (2018)
[44] Zontak, M., Irani, M.: Internal statistics of a single natural image. In: CVPR 2011. pp. 977–984. IEEE (2011)

S2R: Exploring a Double-Win Transformer-Based Framework for Ideal and Blind Super-Resolution

Abstract

Keywords:

1 Introduction

2 Related Work

2.1 Single Image Super Resolution

2.2 Blind Super-Resolution

3 The Proposed S2R Framework

3.1 Network Architecture

3.1.1 ELAB

3.1.2 MBConv

3.1.3 Deform Conv

3.2 The Proposed Coarse-to-Fine Training Strategy

3.2.1 Coarse-Level Learning on the General Dataset

3.2.2 Fast Fine-Tune on Real-World Images

4 Experiments &\& Results

4.1 Training Details

4.2 Evaluations on Ideal Super-Resolution Dataset

4.2.1 Quantitative comparison

4.2.2 Qualitative comparison

4.3 Evaluations on Real-World Images with Unpredictable Blur Kernels

4.3.1 Quantitative comparison

4.3.2 Qualitative comparison

4.4 Ablation Study

4.4.1 For Network Architecture

4.4.2 For Training Strategy

5 Conclusion

References

S²R: Exploring a Double-Win Transformer-Based Framework for Ideal and Blind Super-Resolution

3 The Proposed S²R Framework

4 Experiments $\&$ Results