Video Super-Resolution with Long-Term Self-Exemplars

Guotao Meng gmeng@connect.ust.hk HKUST , Yue Wu ywudg@connect.ust.hk HKUST , Sijin Li sijin.li@dji.com DJI and Qifeng Chen cqf@ust.hk HKUST

Abstract.

Existing video super-resolution methods often utilize a few neighboring frames to generate a higher-resolution image for each frame. However, the redundant information between distant frames has not been fully exploited in these methods: corresponding patches of the same instance appear across distant frames at different scales. Based on this observation, we propose a video super-resolution method with long-term cross-scale aggregation that leverages similar patches (self-exemplars) across distant frames. Our model also consists of a multi-reference alignment module to fuse the features derived from similar patches: we fuse the features of distant references to perform high-quality super-resolution. We also propose a novel and practical training strategy for referenced-based super-resolution. To evaluate the performance of our proposed method, we conduct extensive experiments on our collected CarCam dataset and the Waymo Open dataset, and the results demonstrate our method outperforms state-of-the-art methods. Our source code will be publicly available.

Refer to caption — Figure 1. The visual comparison between our approach and state-of-the-art video super-resolution methods.

1. Introduction

Super-resolution (SR) is a fundamental problem in computational photography, which aims to reconstruct a high-resolution (HR) image from a single or a sequence of low-resolution (LR) images. While image super-resolution (Haris et al., 2018; Wang et al., 2018b; Dong et al., 2016b; Ledig et al., 2017; Kim et al., 2016; Lim et al., 2017; Huang et al., 2015a; Dong et al., 2014, 2016a) exploits spatial information to recover missing details, video super-resolution (VSR) (Ma et al., 2015; Sajjadi et al., 2018; Wang et al., 2019; Haris et al., 2019; Tian et al., 2020; Chu et al., 2020; Jo et al., 2018; Yi et al., 2019; Tao et al., 2017; Wang et al., 2018a; Huang et al., 2015b; Caballero et al., 2017) needs to utilize additional temporal information from other frames to reconstruct clear images. Currently, VSR is widely adopted in video surveillance, satellite imagery, computational photography, etc.

The typical limitation of existing VSR methods is that only a few neighboring frames are utilized (usually $3\sim 7$ frames). On the other hand, much information in long-term frames is rarely exploited, based on the assumption that nearby frames probably contain similar content while distant frames do not. A long video that contains relevant content among long-term frames is ubiquitous, especially in driving scenarios.

A large portion of previous methods relies on motion compensation (Liao et al., 2015; Shi et al., 2016; Xue et al., 2019; Wang et al., 2019; Tian et al., 2020). These methods first perform optical flow estimation or deformable convolution (Dai et al., 2017) to align the frames and then use aligned frames to reconstruct the image. However, estimating dense optical flow between two distant frames is difficult. Moreover, imperfect flow estimation often leads to unsatisfactory artifacts in the SR results of these flow-based VSR approaches. Thus, the motion-based methods are not applicable when exploiting long-term information. Another stream of VSR (Haris et al., 2019; Isobe et al., 2020a; Sajjadi et al., 2018) is to use recurrent models to store long-term information. However, these methods usually employ a fixed-size feature vector to store previous content, thus it is tough to memorize high-frequency details. Our method can exploit long-term self-exemplars to reconstruct video better.

According to our observation, there is plenty of information about patch recurrence in a video. Objects may appear small and blurry in one frame but become large and clear in another frame. This is because the distance from the objects to the camera is changing. The patch of a high-resolution object in other frames can be used as a reference for super-resolution.

We introduce a long-term non-local aggregation method to leverage the most similar patches across frames, as shown in Fig. 2. Because the information contained in the whole sequence is redundant and not easy to be processed directly by a network, we propose a self-exemplar retrieval module to search self-exemplars across frames. In our method, for each frame, we first uniformly divide this frame into several patches. Then, each patch is used as a $query$ to search higher-resolution global self-exemplars and local self-exemplars at the same resolution. For the global self-exemplars, we propose a global texture aggregation module to select useful references and initially align them with $query$ . Then, we propose a feature alignment module to align the feature maps of these self-exemplars by affine transformation. Moreover, we use a multi-reference fusion module to fuse global features. Then, we use a long-term and short-term feature aggregation module to fuse long-term information and short-term information to reconstruct details. Also, we propose a novel training strategy to solve the imbalance problem of data distribution.

Since commonly used VSR datasets contain only several frames or have slight camera motion. To demonstrate the effectiveness of our method, we choose the driving scene as a classical application scenario as it has large camera motion. We use two datasets to evaluate the performance of our method. We collected a CarCam dataset that contains 139 video sequences. We also use the public Waymo Open dataset (Sun et al., 2020). Our method outperforms state-of-the-art (SOTA) methods on the CarCam and Waymo datasets (Sun et al., 2020).

Our contributions can be summarized as follows:

•

To better exploit redundant information in distant frames, we propose a novel long-term cross-scale aggregation method leveraging self-exemplars.
•

We propose several novel modules to enhance the reconstruction performance: self-exemplar retrieval, multi-reference selection, and pre-alignment, feature alignment, and a novel and practical training strategy.
•

We collected the new CarCam dataset with 139 dashcam videos. The experiments show that our proposed method outperforms state-of-the-art VSR methods on the CarCam and the Waymo Open datasets.

2. Related Work

Super-resolution is a classical task in computer vision. Traditional methods propose example-based strategy (Freeman et al., 2002; Glasner et al., 2009; Schulter et al., 2015; Timofte et al., 2013, 2014; Yang et al., 2010b), self-similarity (Huang et al., 2015a; Yang et al., 2010a) and dictionary learning (Pérez-Pellitero et al., 2016; Yang et al., 2012). With the rapid development of deep learning, super-resolution makes a big progress. In this section, we discuss works related to video super-resolution and reference-based SR.

Video super-resolution The results of VSR should contain realistic visual content and maintain the temporal consistency of output frames. Most VSR methods try to utilize the information of a few neighboring frames. FRVSR (Sajjadi et al., 2018) recurrently generates SR frame from previously estimated SR frame. VSR-DUF (Jo et al., 2018) propose to learn a dynamic upsampling filter to avoid explicit motion compensation. PFNL (Yi et al., 2019) introduces a non-local operation to fusion multiple LR images to generate SR output. RBPN (Haris et al., 2019) uses an encoder-decoder module to integrate spatial and temporal contexts recurrently. TecoGAN (Chu et al., 2020) proposes to use adversarial neural networks to improve the quality of SR. EDVR (Wang et al., 2019), TDAN (Tian et al., 2020) and MuCAN (Li et al., 2020) utilize deformable convolution module to align frames. RSDN (Isobe et al., 2020a) divides the input into structure and detail components and propose two-stream structure-detail blocks. VSR-TGA (Isobe et al., 2020b) groups input frames into subgroups by different frame rate.

In our framework, accurate motion estimation is not required. We can efficiently utilize information in distant frames, even across tens of frames.

Reference-based SR Reference-based SR(RefSR) aims to extract high-resolution details from the reference image provided by the user. Some of the existing RefSR methods (Zheng et al., 2018; Yue et al., 2013) align the LR and Ref image by transformation. Another stream of RefSR methods (Zhang et al., 2019; Yang et al., 2020) employs feature patch matching to transfer HR textures from a Ref image. SRNTT (Zhang et al., 2019) adopts a pre-trained classification model to align the patches while TTSR (Yang et al., 2020) employs a learnable feature extraction network.

Difference between our method and RefSR. First, in RefSR, the usrs are required to provide a high-resolution image as the reference. Whereas in our setting, all the information are extracted from the video itself. Second, RefSR needs to exploit beneficial content even if the reference is not visually related. Thus, SOTA refsr align features in patch level, which will result in corruption of content. However, in our setting, our reference is content similar to our LR image, thus we use a affine transformation module to align the image.

3. Method

Our key observation is that, for low-resolution content, it is highly likely there are similar and clearer high-resolution details in other frames, which provide valuable information for SR. Based on this observation, we propose to exploit the information from other frames as the reference to super-resolve a video frame. For each patch in the frame to be super-resolved, we first employ Self-exemplar Retrieval module to search the larger self-exemplars from all the video frames as the global self-exemplars. And we also use this retrieval module to search self-exemplars with the same resolution from neighboring frames. Then we use a Global Self-exemplar Selection and Pre-alignment module to select the most valuable references. Then, we propose a Multiple Reference Feature Alignment module to align reference features to the LR patch feature. Finally, the features of the input patch, the local reference patches, and the global reference patches are fused and fed into a network to generate the high-resolution output.

3.1. Self-exemplar Retrieval

Global self-exemplars. Because the time interval between these patches can be several seconds, it is difficult to compute accurate motion compensation or align frames precisely. Furthermore, the scale and view angle of an instance may vary significantly in a video, it is difficult to get the correspondence from the whole video by optical flow or object tracking. Thus we propose a patch retrieval strategy to search larger self-exemplars without the motion estimation or optical flow.

We first define a increasing image scale sequence $C=[c_{1},...,c_{m}]$ , where the scale $c_{e}>1$ and $e$ is the index. Then, for each $c_{e}$ , we first apply bicubic up-sampling on frame $I_{t}$ using scale $c_{e}$ to get an up-sampled query image $I_{t}{\uparrow}$ where $t$ is the time stamp. We also sequentially apply bicubic down-sampling and up-sampling on global frames $G=\left\{I_{1},...,I_{T}\right\}$ using scale $c_{e}$ to obtain blurry frames $G{\downarrow\uparrow}$ to match the frequency band of $I_{t}{\uparrow}$ . Then, we uniformly divide $I_{t}{\uparrow}$ and $I_{t}$ into $n$ patches $\tilde{P}_{t}=\left\{\tilde{p}_{t}^{1},\tilde{p}_{t}^{2},...\tilde{p}_{t}^{n}\right\}$ and $P_{t}=\left\{p_{t}^{1},p_{t}^{2},...p_{t}^{n}\right\}$ with overlap.

$\tilde{p}_{t}^{i}$ and $p_{t}^{i}$ represent the $i$ -th patch on $I_{t}\uparrow$ and $I_{t}$ . $n$ is the total number of patches. We use each patch $\tilde{p}_{t}^{i}$ as a query to search larger self-exemplars in each frame of $G{\downarrow\uparrow}$ using template matching (Brunelli, 2009). The similarity metric in the template matching is the cosine distance.

After this operation, we obtain the answer patch $q_{e}$ for query patch $p_{t}^{i}$ in scale $c_{e}$ within the highest cosine score.

The global self-exemplars of $p_{t}^{i}$ is constructed by $\left\{M_{g}\right\}_{t}^{i}=\left\{q_{1}^{g},...q_{m}^{g}\right\}$ . For simplicity, $\left\{M_{g}\right\}_{t}^{i}$ is denoted as $M_{g}$ . In our method, the inaccurate patches in $M_{g}$ is then filtered out using our global self-exemplar selection module, which will be discussed in detail later.

Local self-exemplars. We use the patch retrieval introduced above to search the most similar patches of $p_{t}^{i}$ in its neighboring area over the neighboring frames $N=\left\{I_{t-2},I_{t-1},I_{t+1},I_{t+2}\right\}$ using the scale factor $1$ . The patch retrieval result is denoted as local self-exemplars $M_{l}=\left\{q_{1}^{l},q_{2}^{l}...\right\}$ .

3.2. Global Texture Aggregation

In this module, the feature of global self-exemplars are selected, aligned and fused. For simplicity, in this section we denote the patch to be resolved as $p$ , the global self-exemplars of $p$ as $M$ .

3.2.1. Global Self-exemplar Selection and Pre-alignment

To fuse the information in the self-exemplars of a patch, it is not applicable to simply concatenate the feature of each patch in $M$ as the fusion strategy. Because some of the patches in $M$ may have irrelevant content which provides miss-guided information. Moreover, directly compute the similarity of the patches in $M$ with the $p$ leads to a unsatisfied result for the smaller patches in $M$ often have a higher similarity score but provide less high-frequency details. Thus, we design a novel metric to measure the quality of reference to select the most valuable patch from $M$ .

For each patch $q$ in $M$ , we first up-sample $p$ to $p\uparrow$ to match the spatial resolution of $q$ . And we apply sequential down-sampling and up-sampling on $q$ to $q\downarrow\uparrow$ to match the frequency band of $p\uparrow$ . We use a pretrained VGG-19 (Simonyan and Zisserman, 2015) $\phi$ network to extract features of $p\uparrow$ and $q\downarrow\uparrow$ .

Like the feature swapping operation in (Zhang et al., 2019), we unfold the feature of $p\uparrow$ and $q\downarrow\uparrow$ into $3\times 3$ feature blocks for feature matching. The cosine distance is used to measure the similarity between the feature blocks:

(1)

s_{g,h}=\left\langle\frac{B_{g}(\phi(p\uparrow))}{\left\|B_{g}(\phi(p\uparrow))\right\|},\frac{B_{h}(\phi(q\downarrow\uparrow))}{\left\|B_{h}(\phi(q\downarrow\uparrow))\right\|}\right\rangle,

where $B_{g}(\cdot)$ denotes sampling the $g$ -th $3\times 3$ feature block, and $s_{g,h}$ is the similarity between the $g$ -th feature block of $p\uparrow$ and $h$ -th feature block of $q\downarrow\uparrow$ . Then we search over all the reference feature blocks:

(2)		$\displaystyle h^{*}$	$\displaystyle=\underset{h}{\rm argmax}\quad s_{g,h},$
(3)		$\displaystyle S_{g}$	$\displaystyle=s_{g,h^{*}},$

where $S_{g}$ is the similarity map between $p$ and answer $q$ on position $g$ . Different from feature swapping operation, we do not directly swap the feature blocks of $\phi(q)$ to form the aligned feature map for $p$ . Feature swapping take the average of the swapped features in the regions where they overlap, which will result in texture corruption as shown in fig 4. We use a novel affine transformation based alignment module to align the reference feature, which will be discussed in detail in Sec. 3.2.2.

We define a novel distance map to measure the content match level of references for self-exemplar selection.

(4)

D_{g,h^{*}}=\left\|(x_{g},y_{g})-(x_{h^{*}},y_{h^{*}})\right\|_{2}^{2},

where $(x_{g},y_{g})$ and $(x_{h^{*}},y_{h^{*}})$ are spatial coordinates of $g$ -th feature block of $\phi(p\uparrow)$ and $h^{*}$ -th feature block of $\phi(q\downarrow\uparrow)$ . If $q$ has the relevant content as $p$ , most the matched feature block pairs from $\phi(p\uparrow)$ and $\phi(q\downarrow\uparrow)$ probably appear at similar position. Based on this observation, we filter inaccurate patches in $M$ when $mean(D)>\delta$ out according to threshold $\delta$ . Then the largest $n$ remaining patches in $M$ are selected as $M^{\prime}=\{q^{\prime}_{1},q^{\prime}_{2}...q^{\prime}_{k}\}$ . If the number of the remaining patches is smaller than $k$ , the patches with the smallest $mean(D)$ are used to fill $M^{\prime}$ . $\delta$ and $k$ are set as 0.1, 3 empirically.

3.2.2. Multiple Reference Feature Alignment

After obtaining multiple references, we abandon previous commonly used feature swapping align scheme, because the information will be corrupted. And motion estimation can not be directly used, since the time interval between $p$ and $q$ may be several seconds, and the viewpoint change may be large. Thus, we propose a new affine transformation based alignment module to align the patches:

(5)

\theta^{*}=\underset{\theta}{\rm argmin}\sum_{g}\left\|(x_{g},y_{g})-\mathcal{T}(x_{h^{*}},y_{h^{*}};\theta)\right\|_{2},

where $\mathcal{T}$ represents the affine transformation, and $\theta$ is the parameters of $\mathcal{T}$ . We use RANSAC algorithms to obtain $\theta^{*}$ which minimizes the sum of Euclidean distances between spatial coordinates of $q^{\prime}$ after transformed and target coordinates of $p$ .

We use a sequence of residual blocks (He et al., 2016) to extract features $f_{q}$ of each $q^{\prime}$ and $f_{p}$ of $p$ :

(6)

f_{q}^{*}=\mathcal{T}(f_{q};\theta^{*}).

Then the feature maps $f_{q}^{*}$ are aligned to $p$ after the affine transformation $\mathcal{T}$ .

Comparison to feature swap Feature swapping takes the average of transferred textures due to dense sampling and the original information will be corrupted. For visualization, we replace the feature map using image. As shown in Fig. 4, the text is broken and blurry although reference patch contain clear and correct details.

3.2.3. Multiple reference fusion

To fuse the transformed feature maps, we use a network constructed by residual blocks to predict the weight map of each $f_{q}^{*}$ . This network takes the feature map $f_{p}$ , $f_{q}^{*}$ , similarity map $S$ and distance map $D$ as input. Then we use a softmax function to predict a weight map $w$ for each $f_{q}^{*}$ . The weight network is implemented by a set of residual blocks:

(7)

f_{global}=\sum_{r=1}^{k}w_{r}\cdot f_{q,r}^{*},

where $f^{*}_{q,r}$ is the aligned feature map of $r$ -th $q^{\prime}$ .

3.3. Implementation

Long-term and short-term feature aggregation We use a sequence of residual blocks to extract features $f_{local},f_{p}$ of local self-exemplars and $p$ . The $f_{global},f_{local}$ and $f_{p}$ are concatenated together, fed into a SR network to generate SR output. The SR network contains 8 residual blocks.

During training, we adopt the Charbonnier Loss, which is defined as

(8)

L=\sqrt{\left\|\tilde{I_{t}}-I_{t}^{H}\right\|^{2}+\epsilon^{2}},

where $\tilde{I_{t}}$ is the predicted result, $I_{t}^{H}$ is the ground truth, and $\epsilon$ is a small constant. $\tilde{I_{t}}$ is obtained by splicing SR results of patches.

Training strategy for data imbalance. The effectiveness of reference is related to the spatial locations of LR patches. The region far from camera is less likely to have very beneficial reference since the distance between background and camera changes little . Although the camera is moving, the size variance of background region is small. And the region near camera or self-moving objects have a large probability to have good references. As the background region occupies large portion of the image, there exists an imbalance in data distribution between region having good references and region having normal references. Only $17\%$ of the patches have valid 2x reference patches based on the distribution of the dataset. If we train the network using the normal training strategy, the network will rely heavily on local references without utilizing long-term information.

Thus, we propose a special training strategy that we randomly replace one reference patch with the slightly adjusted ground truth patch. The ground truth patch is affine transformed a little bit for generalization. This will solve this problem by balancing the data distribution of the reference patches and the network can learn to draw important details from long-term reference features. The probability of replacing operation is set as 0.3 empirically.

4. Experiments

4.1. Datasets

We use two datasets for evaluation, CarCam dataset and Waymo Open dataset (Sun et al., 2020).

CarCam dataset To demonstrate the generalization performance of our method, we collect a CarCam dataset contains videos captured in different cities, using different cameras. Our CarCam dataset contains 139 high-resolution video clips from YouTube. The videos are captured in multiple cities including HongKong, Paris, Hollywood, and Chicago. Each sequence contains 60 frames, and the shape of the frame is $1920\times 1080$ . The length of each clip is 10 seconds. To evaluate the performance quantitatively, we downsample the videos by a factor of 4 using BICUBIC to obtain LR videos. The frame rate is 6fps.

Waymo Open dataset (Sun et al., 2020) We collect 100 sequences with rich texture from Waymo Open dataset. Each sequence contains 50 frames, and the resolution of the frame is $1920\times 1280$ . And we use 70 sequences as training and 30 sequences as testing.

4.2. Implementation Details

Network settings In patch retrieval process, the patch size and stride are $32\times 32$ and 24. All the images are downsampled $2\times$ for fast searching. The scale sequence $C$ is defined as $\left[1.2,1.4,1.7,2.1,2.5,2.9,3.5\right]$ . The number of global references and local references are set as 3 and 2. The backbone of our network is a sequence of residual blocks. The number of residual blocks of feature extractor, global fusion net, local fusion net, and SR network is 8, 4, 2, 8.

Training details We train our network using 5 NVIDIA GeFore GTX 3090 GPUs with batch-size 14 per GPU. The training takes 20 epochs for all datasets. We use Adam as the optimizer and the learning rate is set as 1e-4 initially and decay to 3e-5 after 15 epochs. The training data are augmented with random cropping, flipping and rotation.

4.3. Evaluation

We evaluate our method with previous state-of-the-art methods including PFNL (Yi et al., 2019), RBPN (Haris et al., 2019), TDAN (Tian et al., 2020), TGA (Isobe et al., 2020b), MuCAN (Li et al., 2020) and RSDN (Isobe et al., 2020a). For fair comparison, we re-train all the methods on these two datasets carefully. For the methods without training code, we carefully re-implement them. The quantitative results is presented in Table 1. The evaluation metrics are SSIM, PSNR and LPIPS (Zhang et al., 2018).

On the CarCam and Waymo Open datasets, our method outperforms other methods by at least 0.25dB and 0.40dB. Our method also has a better perceptual quality that surpass others by at least 0.002 and 0.011 in LPIPS. All the results demonstrate the effectiveness of our method. Several examples are visualized in Figure 5 and 6. Although the quality of LR image is quite low, our method can reconstruct high-frequency details by properly utilizing global self-exemplars, while previous methods can not. More visual results are provided in supplementary material.

Table 1. Quantitative evaluation of our approach and state-of-the-art video super-resolution methods.

	CarCam			Waymo Open
	SSIM $\uparrow$	PSNR $\uparrow$	LPIPS $\downarrow$	SSIM $\uparrow$	PSNR $\uparrow$	LPIPS $\downarrow$
BICUBIC	0.790	25.78	0.393	0.890	31.70	0.303
PFNL (Yi et al., 2019)	0.882	28.50	0.164	0.934	34.81	0.155
RBPN (Haris et al., 2019)	0.886	28.64	0.158	0.937	35.10	0.149
TDAN (Tian et al., 2020)	0.870	28.02	0.178	0.926	34.10	0.167
TGA (Isobe et al., 2020b)	0.876	28.19	0.172	0.933	34.73	0.162
MUCAN (Li et al., 2020)	0.900	29.29	0.138	0.941	35.46	0.149
RSDN (Isobe et al., 2020a)	0.894	29.13	0.150	0.936	35.04	0.153
Ours	0.904	29.54	0.136	0.945	35.86	0.138

4.4. Ablation study

We conduct ablation study about several components to demonstrate the effectiveness of our proposed method.

Table 2. Ablation study conducted on the CarCam dataset.

	SSIM $\uparrow$	PSNR $\uparrow$
WG	0.896	29.129
WT	0.898	29.247
WA	0.896	29.125
Full Model	0.904	29.542

Global Texture Aggregation Module. We build a baseline without utilizing information in long-term frames(WG). The global texture branch is removed. As shown in Table 2, the baseline yields 29.129 dB in PSNR, and 0.896 in SSIM. This baseline is slightly weaker than MuCan (Li et al., 2020) because we only adopt a naive approach to process local self-exemplars, and want to focus on how to exploit long-term self-exemplars. Utilizing global self-exemplars brings 0.41 dB improvement on PSNR, which proves the effectiveness of global texture aggregation module.

Then, we evaluate how the number of global self-exemplars $K$ and local self-exemplars $V$ affect performance, as shown in Table 3. For convenience, the amount of training epochs is reduced to 3. And the number $V$ of local self-exemplars is fixed as 2. The performance of the global texture aggregation module rises at first then drops as the number of global self-exemplars increases. It demonstrates that more global cues provide comprehensive and beneficial details. However, adding more references will bring irrelevant examples, result in unwanted noise and performance degradation.

Table 3. Results with different numbers (K) of global self-exemplars on the CarCam dataset. The number (V) of local self-exemplars is fixed as 2.

K	SSIM $\uparrow$	PSNR $\uparrow$
1	0.892	28.922
2	0.892	28.941
3	0.893	28.994
4	0.893	28.965

We also do the experiment to investigate the balance between global self-exemplars and local self-exemplars, as shown in Table 4. This experiment is also conducted using reduced epoch number 3. Using different numbers of $K$ , we change the number of $V$ from 2 to 4 to investigate the capability of our model. When the global cues are not adequate, adding local references will help recovery. However, when the larger-scale self-exemplars are sufficient, more local cues are redundant and unwanted.

Training Strategy We build a baseline (WT) without randomly use the ground truth as global self-exemplars. As shown in Table 2, this strategy improves the result by 0.295dB. This strategy resolves the data imbalance problem, and help promote the performance when having large-scale self-exemplars.

Affine transformation As indicated in Sec. 3.2.2, directly adopt commonly used feature swapping (Zhang et al., 2019) will result in content corruption as shown in Fig. 4. As feature swapping unfold a feature map into $3\times 3$ patches, and overlap the transferred texture, the clear content will be broken. As shown in Table 2, using affine transformation brings 0.417 dB improvement.

4.5. Further Application

Because the short-term information fusion module in our method is relatively simpler than them in previous methods, the combination of the long-term information fusion module in our method and the local-information fusion modules in the previous methods generates better super-resolution results. For convenience, we implement our method as a post-processing of the results of the previous methods. We conduct experiments on the MuCAN (Li et al., 2020) method on the proposed CarCam dataset. The output of the MuCAN is used as the input of our method. After our post-processing module, the PSNR increases from 29.29dB to 29.42dB, and the SSIM increases from 0.9 to 0.902. It shows that our method can be simply implemented as a refinement module for previous video super-resolution method to improve the performance.

Table 4. Our results with different numbers (K) of global self-exemplars and changing the number (V) of local self-exemplars from 2 to 4 on the CarCam dataset. The numbers indicate the changes in PSNR.

\uparrow

indicates performance increase while

\downarrow

indicates performance degradation

K	V=2-¿V=4
1	0.044 $\uparrow$
2	0.038 $\uparrow$
3	-0.036 $\downarrow$

5. Conclusion

The key contribution of our proposed method is the exploitation of the long-term content in all the frames of a video while the previous methods focus on the fusion of the short-term information. In this paper, we have proposed a novel video super-resolution method with long-term cross-scale aggregation by utilizing self-exemplars across distant frames. We propose a novel global texture module to select, align and fuse features derived from similar patches. We fuse the features of both long-term and short-term references and propose a novel training strategy for data imbalance problem. Extensive experiments demonstrate the effectiveness of our proposed method. Our method has many potential applications not only limited to car camera scenario, such as hand-held videos, drone videos, and surveillance.

References

(1)
Brunelli (2009) Roberto Brunelli. 2009. Template Matching Techniques in Computer Vision: Theory and Practice. Wiley Publishing.
Caballero et al. (2017) Jose Caballero, Christian Ledig, Andrew Aitken, Alejandro Acosta, Johannes Totz, Zehan Wang, and Wenzhe Shi. 2017. Real-time video super-resolution with spatio-temporal networks and motion compensation. In CVPR.
Chu et al. (2020) Mengyu Chu, You Xie, Jonas Mayer, Laura Leal-Taixe, and Nils Thuerey. 2020. Learning Temporal Coherence via Self-Supervision for GAN-based Video Generation. In SIGGRAPH.
Dai et al. (2017) Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. 2017. Deformable Convolutional Networks. In ICCV.
Dong et al. (2014) Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. 2014. Learning a Deep Convolutional Network for Image Super-Resolution. In ECCV.
Dong et al. (2016b) Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. 2016b. Image Super-Resolution Using Deep Convolutional Networks. TPAMI (2016).
Dong et al. (2016a) Chao Dong, Chen Change Loy, and Xiaoou Tang. 2016a. Accelerating the Super-Resolution Convolutional Neural Network. In ECCV.
Freeman et al. (2002) William T. Freeman, Thouis R. Jones, and Egon C. Pasztor. 2002. Example-Based Super-Resolution. IEEE Computer Graphics and Applications 22, 2 (2002), 56–65. https://doi.org/10.1109/38.988747
Glasner et al. (2009) Daniel Glasner, Shai Bagon, and Michal Irani. 2009. Super-resolution from a single image. In ICCV.
Haris et al. (2018) Muhammad Haris, Gregory Shakhnarovich, and Norimichi Ukita. 2018. Deep Back-Projection Networks for Super-Resolution. In CVPR.
Haris et al. (2019) Muhammad Haris, Gregory Shakhnarovich, and Norimichi Ukita. 2019. Recurrent Back-Projection Network for Video Super-Resolution. In CVPR.
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR.
Huang et al. (2015a) Jia-Bin Huang, Abhishek Singh, and Narendra Ahuja. 2015a. Single image super-resolution from transformed self-exemplars. In CVPR.
Huang et al. (2015b) Yan Huang, Wei Wang, and Liang Wang. 2015b. Bidirectional recurrent convolutional networks for multi-frame super-resolution. In NeurIPS.
Isobe et al. (2020a) Takashi Isobe, Xu Jia, Shuhang Gu, Songjiang Li, Shengjin Wang, and Qi Tian. 2020a. Video Super-Resolution with Recurrent Structure-Detail Network. In ECCV.
Isobe et al. (2020b) Takashi Isobe, Songjiang Li, Xu Jia, Shanxin Yuan, Gregory G. Slabaugh, Chunjing Xu, Ya-Li Li, Shengjin Wang, and Qi Tian. 2020b. Video Super-Resolution With Temporal Group Attention. In CVPR.
Jo et al. (2018) Younghyun Jo, Seoung Wug Oh, Jaeyeon Kang, and Seon Joo Kim. 2018. Deep Video Super-Resolution Network Using Dynamic Upsampling Filters Without Explicit Motion Compensation. In CVPR.
Kim et al. (2016) Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. 2016. Deeply-Recursive Convolutional Network for Image Super-Resolution. In CVPR.
Ledig et al. (2017) Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew P. Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, and Wenzhe Shi. 2017. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. In CVPR.
Li et al. (2020) Wenbo Li, Xin Tao, Taian Guo, Lu Qi, Jiangbo Lu, and Jiaya Jia. 2020. MuCAN: Multi-correspondence Aggregation Network for Video Super-Resolution. In ECCV.
Liao et al. (2015) Renjie Liao, Xin Tao, Ruiyu Li, Ziyang Ma, and Jiaya Jia. 2015. Video Super-Resolution via Deep Draft-Ensemble Learning. In ICCV.
Lim et al. (2017) Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. 2017. Enhanced Deep Residual Networks for Single Image Super-Resolution. In CVPRW.
Ma et al. (2015) Ziyang Ma, Renjie Liao, Xin Tao, Li Xu, Jiaya Jia, and Enhua Wu. 2015. Handling Motion Blur in Multi-Frame Super-Resolution. In CVPR.
Pérez-Pellitero et al. (2016) Eduardo Pérez-Pellitero, Jordi Salvador, Javier Ruiz Hidalgo, and Bodo Rosenhahn. 2016. PSyCo: Manifold Span Reduction for Super Resolution. In CVPR.
Sajjadi et al. (2018) Mehdi S. M. Sajjadi, Raviteja Vemulapalli, and Matthew Brown. 2018. Frame-Recurrent Video Super-Resolution. In CVPR.
Schulter et al. (2015) Samuel Schulter, Christian Leistner, and Horst Bischof. 2015. Fast and accurate image upscaling with super-resolution forests. In CVPR.
Shi et al. (2016) Wenzhe Shi, Jose Caballero, Ferenc Huszar, Johannes Totz, Andrew P. Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. 2016. Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. In CVPR.
Simonyan and Zisserman (2015) Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In ICLR.
Sun et al. (2020) Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. 2020. Scalability in perception for autonomous driving: Waymo open dataset. In CVPR.
Tao et al. (2017) Xin Tao, Hongyun Gao, Renjie Liao, Jue Wang, and Jiaya Jia. 2017. Detail-Revealing Deep Video Super-Resolution. In ICCV.
Tian et al. (2020) Yapeng Tian, Yulun Zhang, Yun Fu, and Chenliang Xu. 2020. TDAN: Temporally-Deformable Alignment Network for Video Super-Resolution. In CVPR.
Timofte et al. (2013) Radu Timofte, Vincent De Smet, and Luc Van Gool. 2013. Anchored Neighborhood Regression for Fast Example-Based Super-Resolution. In ICCV.
Timofte et al. (2014) Radu Timofte, Vincent De Smet, and Luc Van Gool. 2014. A+: Adjusted Anchored Neighborhood Regression for Fast Super-Resolution. In ACCV.
Wang et al. (2018a) Longguang Wang, Yulan Guo, Zaiping Lin, Xinpu Deng, and Wei An. 2018a. Learning for Video Super-Resolution through HR Optical Flow Estimation. In ACCV.
Wang et al. (2019) Xintao Wang, Kelvin C. K. Chan, Ke Yu, Chao Dong, and Chen Change Loy. 2019. EDVR: Video Restoration With Enhanced Deformable Convolutional Networks. In CVPRW.
Wang et al. (2018b) Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy. 2018b. ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks. In ECCV.
Xue et al. (2019) Tianfan Xue, Baian Chen, Jiajun Wu, Donglai Wei, and William T. Freeman. 2019. Video Enhancement with Task-Oriented Flow. IJCV (2019).
Yang et al. (2010a) Chih-Yuan Yang, Jia-Bin Huang, and Ming-Hsuan Yang. 2010a. Exploiting Self-similarities for Single Frame Super-Resolution. In ACCV.
Yang et al. (2020) Fuzhi Yang, Huan Yang, Jianlong Fu, Hongtao Lu, and Baining Guo. 2020. Learning Texture Transformer Network for Image Super-Resolution. In CVPR.
Yang et al. (2012) Jianchao Yang, Zhaowen Wang, Zhe Lin, Scott Cohen, and Thomas S. Huang. 2012. Coupled Dictionary Training for Image Super-Resolution. IEEE Trans. Image Process. 21, 8 (2012), 3467–3478. https://doi.org/10.1109/TIP.2012.2192127
Yang et al. (2010b) Jianchao Yang, John Wright, Thomas S. Huang, and Yi Ma. 2010b. Image Super-Resolution Via Sparse Representation. IEEE Trans. Image Process. 19, 11 (2010), 2861–2873. https://doi.org/10.1109/TIP.2010.2050625
Yi et al. (2019) Peng Yi, Zhongyuan Wang, Kui Jiang, Junjun Jiang, and Jiayi Ma. 2019. Progressive Fusion Video Super-Resolution Network via Exploiting Non-Local Spatio-Temporal Correlations. In ICCV.
Yue et al. (2013) Huanjing Yue, Xiaoyan Sun, Jingyu Yang, and Feng Wu. 2013. Landmark Image Super-Resolution by Retrieving Web Images. IEEE Trans. Image Process. 22, 12 (2013), 4865–4878. https://doi.org/10.1109/TIP.2013.2279315
Zhang et al. (2018) Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In CVPR.
Zhang et al. (2019) Zhifei Zhang, Zhaowen Wang, Zhe L. Lin, and Hairong Qi. 2019. Image Super-Resolution by Neural Texture Transfer. In CVPR.
Zheng et al. (2018) Haitian Zheng, Mengqi Ji, Haoqian Wang, Yebin Liu, and Lu Fang. 2018. CrossNet: An End-to-End Reference-Based Super Resolution Network Using Cross-Scale Warping. In ECCV.


Before	After