Sub-Aperture Feature Adaptation in Single Image Super-resolution Model for Light Field Imaging

Abstract

With the availability of commercial Light Field (LF) cameras, LF imaging has emerged as an up-and-coming technology in computational photography. However, the spatial resolution is significantly constrained in commercial micro-lens-based LF cameras because of the inherent multiplexing of spatial and angular information. Therefore, it becomes the main bottleneck for other applications of light field cameras. This paper proposes an adaptation module in a pre-trained Single Image Super-Resolution (SISR) network to leverage the powerful SISR model instead of using highly engineered light field imaging domain-specific Super Resolution models. The adaption module consists of a Sub-aperture Shift block and a fusion block. It is an adaptation in the SISR network to further exploit the spatial and angular information in LF images to improve the super-resolution performance. Experimental validation shows that the proposed method outperforms existing light field super-resolution algorithms. It also achieves PSNR gains of more than $1$ dB across all the datasets as compared to the same pre-trained SISR models for scale factor $2$ , and PSNR gains $0.6-1$ dB for scale factor $4$ .

Index Terms— Light field, sub-aperture feature, super-resolution

1 Introduction

A Light Field (LF) camera not only provides spatial information but also captures the angular information of the incoming light from a scene point. Therefore, it enables the LF camera to improve image processing performance in different applications such as depth estimation [1], image segmentation [2], image editing [3], and many more. These techniques can be further improved if we have an image of higher spatial resolution. In the case of an LF camera, multiplexing of angular and spatial information results in poor spatial resolution of Sub-Aperture (SA) images. For example, the Lytro Illum sensor resolution is $40$ MP, but a sub-aperture image’s spatial resolution is close to $0.1$ MP. Therefore it is necessary to achieve Super-Resolution (SR) in LF image by exploiting the additional angular information present in the LF data.

Refer to caption — Fig. 1: Feature extraction and upscale module are from any pre-trained SISR model. The adaptation module aims to learn more information from multiple sub-aperture images. During training, the weights of the adaptation module are updated (shown as an ‘unlocked’ symbol), and the weights of two pre-trained modules are fixed (shown using ‘locked’ symbols).

Recently, Light Field Super-Resolution (LFSR) has been an area of active research, and a considerable amount of work has been done in the last few years. The earlier works were mainly based on Bayesian or variational optimization frameworks with different priors such as variational model [4], Gaussian mixture model [5], and PCA analysis model [6]. These methods are inefficient in exploiting Spatio-angular information of the light field data. In contrast, learning-based methods have been proposed to achieve SR via cascaded convolutions and data-driven training. Single Image Super-Resolution (SISR) methods [7, 8, 9, 10] are becoming increasingly deep and complicated with improved capability in spatial information exploitation. However, the angular information of LF images remains unexploited in SISR networks, resulting in limited performance. Inspired by learning-based methods in SISR and in pursuit of exploiting the angular information, recent LFSR methods [11, 12, 13, 14] adopted deep Convolution Neural Networks (CNNs) to improve SR performance.

This paper proposes a novel Light Field Sub-Aperture Feature Adaptation (LFSAFA) module and puts it into a pre-trained single image super-resolution model for achieving LFSR. LFSAFA exploits the angular information present in the SA images of LF data to improve the performance of LFSR. The proposed module consists of Sub Aperture Shift (SAS) and feature fusion blocks. SAS blocks process the Sub-Aperture (SA) features, and the fusion block combines those features. The modulated SA features are then fed to the upscaling network to reconstruct high-resolution images. Our experimental validation shows that pre-trained SISR models with simple LF-specific modifications can perform better than highly engineered light field image-specific super-resolution models. To summarize, the contributions of this proposed work are as follows.

•

We propose a light-field domain adaptation module to achieve LFSR using SISR models. To the best of our knowledge, this is the first work in this direction.
•

We show that the proposed module can utilize angular information present in SA images to improve the performance, and ablation studies support our claims.
•

Our qualitative and quantitative analysis shows that the performance of our method is better than light-field domain-specific super-resolution solutions, and any SISR models can adopt our proposed modification to make it work for LFSR.

2 Related Work

Due to the advancement of deep learning architectures and algorithms, the LFSR domain has witnessed tremendous progress. C. Yoon et al. [15] super-resolved Sub-Aperture Images (SAIs) via CNN and then fine-tuned using angular information to enhance both spatial and angular resolutions. LF-DCNN [16] super-resolved each SAI via a more powerful SISR network and fine-tuned the initial results using an EPI-enhancement network. LFNet [17] proposed a bidirectional recurrent network by extending BRCNN [18]. Zhang et al. [11] proposed a multi-stream residual network (resLF) by stacking SAIs as inputs to super-resolve the center-view SAI. LFSSR [12] alternately shuffle LF features between SAI pattern and macro-pixel image pattern for convolution. Jin et al. [19] proposed an all-to-one geometric aware method using structural consistency regularization that preserves the parallax structure among reconstructed views. LF-InterNet [13] used spatial-angular information interaction for LFSR. LF-DFnet [14] performed feature alignment using an angular deformable alignment module (ADAM). MEG-Net [20] considered multiple epipolar geometry, and all views are simultaneously super-resolved through an end-to-end manner. DPT [21] used SAIs as a sequence and introduced a detail-preserving transformer to learn geometric dependencies among those sequences.

3 Methodology

3.1 Motivation and Problem Formulation

Well-developed research has been achieved in the single-image super-resolution domain, leading to extraordinary performance in a single image. Images captured by the LF camera can be up-scaled by pre-trained SISR models by considering each sub-aperture image as a single image. However, it fails to exploit the angular and disparity information present in multiple SAIs. Our main objective is to propose an LF domain-specific module on top of the SISR model to achieve LFSR. Recently developed SISR models, $F_{SR}$ can be divided into two parts. One is feature extraction module $F_{feat}$ , and another one is upscaling cum reconstruction module $F_{up}$ . $F_{feat}$ extracts the salient features from a single image that is up-scaled by $F_{up}$ . Our main objective is to introduce a module that modulates the extracted features from $F_{feat}$ by exploiting angular information across SAIs.

3.2 Light-field Sub-aperture Feature Adaptation Module

Using our proposed adaptation module, the pre-trained SISR model adapts the spatial and angular information present in LF images which eventually improves the super-resolution performance further, as shown in Fig. 1. The feature extraction and upscale module are from any pre-trained SISR model, and we place the adaptation module between them. Fig. 2 shows the proposed light-field domain information adaptation module. The light-field camera captures multiple SA images of the same scene. In this work, the rich information present in those angular SA images is utilized to enhance each SA image’s quality. $f_{i}$ is extracted feature set from a sub-aperture image $I_{i}$ using pre-trained feature extraction module $F_{feat}$ of a SISR model. $f_{1},f_{2},...f_{n}$ are extracted features from $n$ different sub-aperture images. Each sub-aperture feature set is processed through its corresponding Sub-Aperture Shift (SAS) module. Each SAS module is expected to process the features in such a way that it can improve the performance on the sub-aperture image $I_{i}$ task in hand. The term shift in SAS is used as it is desired that the SAS module is expected to align the features from different sub-aperture images in the direction of the SA image feature set in hand to improve the performance. Fusion block $F_{s}$ fuses the processed SA features and feeds the fused features $f_{i}^{{}^{\prime}}$ into the upscale module. During training, the weights of the pre-trained feature extraction and upscale modules are not updated during gradient back-propagation. We only update the weights of the adaptation module, as shown in Fig. 1.

SAS Module: If we consider a light-field image of $a\times a$ angular resolution, there will be $n=a^{2}$ number of SA images. Therefore, we have $n$ SAS modules for $n$ SA images. Each module takes its corresponding extracted SA features concatenated with a difference feature map. The difference feature map represents the difference between the SA feature set in hand and the SA feature set of that corresponding SAS module. The difference feature map helps to shift or modify a SA feature set in such a way that it will improve the performance of the SA feature set in hand. The difference map acts as a modulator that decides how much shift is required for a SA feature for pixel-wise mapping. The output of a SAS module can be mathematically represented as

f_{i}^{j}=SAS_{j}([f_{j},f_{i}-f_{j}]),j\in 1,2,...,n

(1)

$f_{i}$ is the extracted features of $i^{th}$ angular SA image, which will be super-resolved, and $f_{j}$ is the extracted features of $j^{th}$ angular SA image. Both $f_{j}$ and $f_{i}-f_{j}$ are concatenated before feeding into $SAS_{j}$ module. This module is expected shift features $f_{j}$ and align with $f_{i}$ using the difference map $f_{i}-f_{j}$ . All the SAS modules will align the features with $f_{i}$ and feed them into the fusion module.

Fusion Module: All the modulated SA features are fused together. It is mathematically expressed as

f_{i}^{{}^{\prime}}=f_{i}+F_{s}([f_{i}^{1},f_{i}^{2},...,f_{i}^{n}])

(2)

$F_{s}$ is the fusion module and $f_{i}^{{}^{\prime}}$ is the sub-aperture informative fused feature of $i^{th}$ SA feature $f_{i}$ .

3.3 Architecture Details

Each SAS module consists of one convolution block and three consecutive residual blocks, as shown in Fig. 2. Fusion block contains two convolution layers. The first convolution layer blends all the SA features using $1\times 1$ convolution, and the second convolution layer processes the fused features. We consider the popular RDN [9] and EDSR [8] architecture of SISR for experimental purposes. In both cases, there are $64$ features in the feature set that are extracted from the feature extraction block. Therefore, we consider $C_{i}=64$ and $C_{x}=32$ in our experimental setup, as shown in Fig. 2.

Table 1: PSNR

/

SSIM values achieved by different methods for

2\times

and

4\times

SR. Our results are shown in bold.

Method	Scale	Dataset
Method	Scale	EPFL [22]	INRIA [23]	STFgantry [24]
Bicubic	$2\times$	29.50 $/$ 0.935	31.10 $/$ 0.956	30.82 $/$ 0.947
VDSR [7]	$2\times$	32.50 $/$ 0.960	34.43 $/$ 0.974	35.54 $/$ 0.979
RCAN [10]	$2\times$	33.16 $/$ 0.964	35.01 $/$ 0.977	36.33 $/$ 0.983
resLF [11]	$2\times$	32.75 $/$ 0.967	34.57 $/$ 0.978	36.89 $/$ 0.987
LFSSR [12]	$2\times$	33.69 $/$ 0.975	35.27 $/$ 0.983	38.07 $/$ 0.990
LF-InterNet [13]	$2\times$	34.14 $/$ 0.976	35.80 $/$ 0.985	38.72 $/$ 0.992
LF-DFnet [14]	$2\times$	34.44 $/$ 0.977	36.36 $/$ 0.984	39.61 $/$ 0.994
MEG-Net [20]	$2\times$	34.31 $/$ 0.977	36.10 $/$ 0.985	38.77 $/$ 0.992
DPT [21]	$2\times$	34.49 $/$ 0.976	36.41 $/$ 0.984	39.43 $/$ 0.993
LFSAFA-EDSR	$2\times$	35.08 $/$ 0.973	37.51 $/$ 0.983	38.69 $/$ 0.990
LFSAFA-RDN	$2\times$	35.19 $/$ 0.974	37.64 $/$ 0.983	39.02 $/$ 0.991
Bicubic	$4\times$	25.14 $/$ 0.831	26.82 $/$ 0.886	25.93 $/$ 0.843
VDSR [7]	$4\times$	27.25 $/$ 0.878	29.19 $/$ 0.921	28.51 $/$ 0.901
RCAN [10]	$4\times$	27.88 $/$ 0.886	29.76 $/$ 0.927	28.90 $/$ 0.911
resLF [11]	$4\times$	27.46 $/$ 0.890	29.64 $/$ 0.934	28.99 $/$ 0.921
LFSSR [12]	$4\times$	28.27 $/$ 0.908	30.31 $/$ 0.945	30.15 $/$ 0.939
LF-InterNet [13]	$4\times$	28.67 $/$ 0.914	30.64 $/$ 0.949	30.53 $/$ 0.943
LF-DFnet [14]	$4\times$	28.77 $/$ 0.917	30.83 $/$ 0.950	31.15 $/$ 0.949
MEG-Net [20]	$4\times$	28.75 $/$ 0.916	30.67 $/$ 0.949	30.77 $/$ 0.945
DPT [21]	$4\times$	28.94 $/$ 0.917	30.96 $/$ 0.950	31.15 $/$ 0.949
LFSAFA-EDSR	$4\times$	29.47 $/$ 0.909	31.88 $/$ 0.945	30.41 $/$ 0.937
LFSAFA-RDN	$4\times$	29.62 $/$ 0.911	32.06 $/$ 0.947	30.80 $/$ 0.941

4 Experiments

4.1 Implementation Details

We used images from $5$ publicly available LF datasets, namely EPFL [22], HCInew [25], HCIold [26], INRIA [23], and STFgantry [24]. We follow the same train-test split as given by [14]. There are a total of $144$ training images in the dataset and consider standard $5\times 5$ angular resolution for benchmark analysis. For testing, we use real-world images from EPFL [22], INRIA [23], and STFgantry [24] datasets which consists of $10$ , $5$ , and $2$ test images, respectively. EDSR [8] and RDN [9] are the base SISR models, where we insert our proposed LFSAFA module to adapt these models for light-field imaging. Bicubic downsampling generates low-resolution (LR) images from its high-resolution (HR) counterpart. We extract random $32\times 32$ crop from LR images during training and augment the patch using random $90^{\circ}$ rotation with a random horizontal and vertical flip. We train the LFSAFA module for $250$ epochs, and each epoch consists of $\sim 1000$ batch updates with a batch size of $4$ . Adam optimizer with a learning rate $10^{-4}$ is used for updating the weights, and the learning rate is reduced by a factor of $0.5$ after every $50$ epochs. The mean absolute difference between output reconstruction and HR image is employed as a loss function. Peak signal-to-noise ratio (PSNR) and Structural Similarity Index (SSIM) are calculated on all the sub-aperture views for comparative performance analysis. Larger values on those metrics imply better reconstruction performance. Following the trend in the LFSR domain, PSNR and SSIM are calculated on the luminance channel Y of an image in YCbCr space. The code is available at https://aupendu.github.io/LFSAFA-SR.

4.2 Performance Analysis

We compare the performance of our proposed approach with state-of-the-art single image super-resolution and light field super-resolution models. VDSR [7], RCAN [10] are the SISR models, and the rest of the methods in Table 1 are the LFSR models. All the SISR models are trained on light field training images for a fair comparison. We can observe from Table 1 that both the SISR models with our proposed modification outperform all the existing techniques in terms of PSNR on EPFL and INRIA datasets. The proposed method cannot outperform other methods in the SSIM metric. However, the SSIM values are very close to the best. Fig. 3 shows the qualitative comparison of our proposed algorithm with existing LFSR approaches. We observe that our proposed algorithm achieves a more satisfactory reconstruction of numbers in the first-row image, excellently reconstructs the round holes in the second-row image, and adequately preserves the line structure, which is on the left side of the third-row image.

Table 2: Model ablation studies of our proposed LFSAFA module and the effect of angular resolution on the reconstruction performance. All the experiments are performed on the LFSAFA-RDN variant for

2\times

SR.

Residual

Connection

Difference

Feature

Adaptation

Module

Angular

Resolution

PSNR/ SSIM

Ablation-1

✗

3\times 3

34.10/ 0.9662

Ablation-2

✓

✗

✓

3\times 3

34.17/ 0.9664

Ablation-3

✗

✓

3\times 3

34.81/ 0.9711

Ablation-4

✓

3\times 3

34.86/ 0.9715

Ablation-5

✓

5\times 5

35.19/ 0.9737

4.3 Ablation Studies

Table 2 shows the model ablation studies of our proposed LFSAFA module. Along with the model ablation, a study on the effect of the angular resolution is given in that table. All the components are the same for both Ablation-4 and Ablation-5 except the angular resolution of the input LF image. We can observe from the table that the network’s performance in terms of PSNR and SSIM increases as we increase the angular resolution. The proposed module LFSAFA gets more angular information as we increase the angular resolution. Therefore, it leads to better performance. This phenomenon also supports our claim that our module has the potential to explore sub-aperture angular information. Other ablations show the effectiveness of three different parts of the LFSAFA module. The first one is the residual connection between the input and output, the second one is the inclusion of difference features into each SAS module, and the third one is the contribution of the whole proposed adaptation module, LFSAFA. The only difference between Ablation-3 and Ablation-4 is the residual connection, and it shows that performance improves slightly with that connection, and we also observe that the network converges faster. Ablation-1 is basically the SISR model without the LFSAFA module. The performance does not improve much even if we add the proposed adaptation module without the difference feature, as shown in Ablation-2. Therefore, the difference feature plays a key role, and we can observe that by comparing the Ablation-2 and Ablation-4. There is a significant jump in performance metrics. Therefore, we can say that the difference feature plays a crucial role in utilizing the sup-aperture information in a controlled manner. Table 3 shows the summarized main contribution of this paper. LFSAFA module helps both the SISR models to adopt the LF domain-specific extra angular information. We can observe a significant $\sim 1dB$ improvement in PSNR metric across all the datasets and models for $2\times$ SR, and $\sim 0.6-1dB$ improvement for $4\times$ SR.

Table 3: Comparative analysis of our proposed LFSAFA module-based LFSR models with their SISR counterparts.

Scale	Dataset	RDN	LFSAFA-RDN	EDSR	LFSAFA-EDSR
$\times 2$	EPFL	34.14 $/$ 0.966	35.19 $/$ 0.974	34.06 $/$ 0.966	35.08 $/$ 0.973
	INRIA	36.42 $/$ 0.978	37.64 $/$ 0.983	36.28 $/$ 0.978	37.51 $/$ 0.983
	STFgantry	37.91 $/$ 0.987	39.02 $/$ 0.991	37.48 $/$ 0.986	38.69 $/$ 0.990
$\times 4$	EPFL	28.84 $/$ 0.898	29.62 $/$ 0.911	28.73 $/$ 0.895	29.47 $/$ 0.909
	INRIA	31.06 $/$ 0.935	32.06 $/$ 0.947	30.92 $/$ 0.933	31.88 $/$ 0.945
	STFgantry	30.18 $/$ 0.932	30.80 $/$ 0.941	29.75 $/$ 0.926	30.41 $/$ 0.937

5 Conclusion

In this work, we propose a module that will turn a SISR model into an LFSR model. The proposed module can be used in all the recently developed SISR models without architectural modifications. This paper presents a new research direction in the LFSR domain, which will drive the community to develop a better sub-aperture feature adaptation module. In the future, a more powerful sub-aperture feature adaptation module can improve the performance further.

References

[1] Jiayong Peng, Zhiwei Xiong, Yicheng Wang, Yueyi Zhang, and Dong Liu, “Zero-shot depth estimation from light field using a convolutional neural network,” IEEE TCI, vol. 6, pp. 682–696, 2020.
[2] Numair Khan, Qian Zhang, Lucas Kasser, Henry Stone, Min H Kim, and James Tompkin, “View-consistent 4d light field superpixel segmentation,” in ICCV, 2019, pp. 7811–7819.
[3] Ki Won Shon, In Kyu Park, et al., “Spatio-angular consistent editing framework for 4d light field images,” Multimedia Tools and Applications, vol. 75, no. 23, pp. 16615–16631, 2016.
[4] Sven Wanner and Bastian Goldluecke, “Variational light field analysis for disparity estimation and super-resolution,” IEEE TPAMI, vol. 36, no. 3, pp. 606–619, 2013.
[5] Kaushik Mitra and Ashok Veeraraghavan, “Light field denoising, light field superresolution and stereo camera based refocussing using a gmm light field patch prior,” in CVPRW. IEEE, 2012, pp. 22–28.
[6] Reuben A Farrugia, Christian Galea, and Christine Guillemot, “Super resolution of light field images using linear subspace projection of patch-volumes,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 7, pp. 1058–1071, 2017.
[7] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee, “Accurate image super-resolution using very deep convolutional networks,” in CVPR, 2016, pp. 1646–1654.
[8] Bee Lim, Sanghyun Son, et al., “Enhanced deep residual networks for single image super-resolution,” in CVPRW, 2017, pp. 136–144.
[9] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu, “Residual dense network for image super-resolution,” in CVPR, 2018, pp. 2472–2481.
[10] Yulun Zhang, Kunpeng Li, et al., “Image super-resolution using very deep residual channel attention networks,” in ECCV, 2018, pp. 286–301.
[11] Shuo Zhang, Youfang Lin, and Hao Sheng, “Residual networks for light field image super-resolution,” in CVPR, 2019, pp. 11046–11055.
[12] Henry Wing Fung Yeung, Junhui Hou, et al., “Light field spatial super-resolution using deep efficient spatial-angular separable convolution,” IEEE TIP, vol. 28, no. 5, pp. 2319–2330, 2018.
[13] Yingqian Wang, Longguang Wang, et al., “Spatial-angular interaction for light field image super-resolution,” in ECCV, 2020.
[14] Yingqian Wang, Jungang Yang, et al., “Light field image super-resolution using deformable convolution,” IEEE TIP, vol. 30, pp. 1057–1071, 2020.
[15] Youngjin Yoon, Hae-Gon Jeon, et al., “Light-field image super-resolution using convolutional neural network,” IEEE SPL, vol. 24, no. 6, pp. 848–852, 2017.
[16] Yan Yuan, Ziqi Cao, and Lijuan Su, “Light-field image superresolution using a combined deep cnn based on epi,” IEEE SPL, vol. 25, no. 9, pp. 1359–1363, 2018.
[17] Yunlong Wang, Fei Liu, et al., “Lfnet: A novel bidirectional recurrent convolutional neural network for light-field image super-resolution,” IEEE TIP, vol. 27, no. 9, pp. 4274–4286, 2018.
[18] Yan Huang, Wei Wang, and Liang Wang, “Bidirectional recurrent convolutional networks for multi-frame super-resolution,” in NeurIPS, 2015, pp. 235–243.
[19] Jing Jin, Junhui Hou, Hui Yuan, and Sam Kwong, “Learning light field angular super-resolution via a geometry-aware network,” in AAAI, 2020.
[20] Shuo Zhang, Song Chang, and Youfang Lin, “End-to-end light field spatial super-resolution network using multiple epipolar geometry,” IEEE TIP, vol. 30, pp. 5956–5968, 2021.
[21] Shunzhou Wang, Tianfei Zhou, Yao Lu, and Huijun Di, “Detail-preserving transformer for light field image super-resolution,” in AAAI, 2022.
[22] Martin Rerabek and Touradj Ebrahimi, “New light field image dataset,” in International Conference on Quality of Multimedia Experience (QoMEX), 2016.
[23] Mikael Le Pendu, Xiaoran Jiang, and Christine Guillemot, “Light field inpainting propagation via low rank matrix completion,” IEEE TIP, vol. 27, no. 4, pp. 1981–1993, 2018.
[24] Vaibhav Vaish and Andrew Adams, “The (new) stanford light field archive,” Computer Graphics Laboratory, Stanford University, vol. 6, no. 7, 2008.
[25] Katrin Honauer, Ole Johannsen, Daniel Kondermann, and Bastian Goldluecke, “A dataset and evaluation methodology for depth estimation on 4d light fields,” in ACCV. Springer, 2016, pp. 19–34.
[26] Sven Wanner, Stephan Meister, and Bastian Goldluecke, “Datasets and benchmarks for densely sampled 4d light fields.,” in Vision, Modelling and Visualization (VMV). Citeseer, 2013, vol. 13, pp. 225–226.