This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Light Field Image Quality Assessment with Auxiliary Learning based on Depthwise and Anglewise Separable Convolutions

Qiang Qu, Xiaoming Chen, Vera Chung,  and Zhibo Chen Date of publication 21 December 2021
Abstract

In multimedia broadcasting, no-reference image quality assessment (NR-IQA) is used to indicate the user-perceived quality of experience (QoE) and to support intelligent data transmission while optimizing user experience. This paper proposes an improved no-reference light field image quality assessment (NR-LFIQA) metric for future immersive media broadcasting services. First, we extend the concept of depthwise separable convolution (DSC) to the spatial domain of light field image (LFI) and introduce “light field depthwise separable convolution (LF-DSC)”, which can extract the LFI’s spatial features efficiently. Second, we further theoretically extend the LF-DSC to the angular space of LFI and introduce the novel concept of “light field anglewise separable convolution (LF-ASC)”, which is capable of extracting both the spatial and angular features for comprehensive quality assessment with low complexity. Third, we define the spatial and angular feature estimations as auxiliary tasks in aiding the primary NR-LFIQA task by providing spatial and angular quality features as hints. To the best of our knowledge, this work is the first exploration of deep auxiliary learning with spatial-angular hints on NR-LFIQA. Experiments were conducted in mainstream LFI datasets such as Win5-LID and SMART with comparisons to the mainstream full reference IQA metrics as well as the state-of-the-art NR-LFIQA methods. The experimental results show that the proposed metric yields overall 42.86% and 45.95% smaller prediction errors than the second-best benchmarking metric in Win5-LID and SMART, respectively. In some challenging cases with particular distortion types, the proposed metric can reduce the errors significantly by more than 60%.

Index Terms:
No-reference Image Quality Assessment, Quality of Experience, Light Field Image, Immersive Media, Auxiliary Learning, Deep Learning.

I Introduction

The immersive media experience is presently undergoing a transition from three degrees of freedom (3 DoF) to six degrees of freedom (6 DoF), which provides a greater sense of immersion and enhanced interactivity [1]. Apart from the ability to change viewing directions through rotation (i.e., pitch, yaw, and roll in 3 DoF), 6 DoF offers three additional degrees of freedom (i.e., surge, heave, and sway). Thus, users may navigate in the virtual world by changing not only the viewing directions but also the viewing positions. With recent advancements in VR and 5G networking technologies, immersive media with 6 DoF is anticipated to be dominant in future broadcasting services [2]. Indeed, leading technology companies have already been engaged in this race for some time. For instance, Google developed a system in 2020 for recording and producing 6 DoF immersive light field videos [3]. Apple has filed a series of patents for 6 DoF light field imaging for next-generation mobile devices [4]. Additionally, Sony has announced an eye-sensing light field display capable of providing a 6 DoF viewing experience in late 2020 [5].

Light field photography is a cornerstone of 6 DoF immersive media. In contrast to conventional two-dimensional imaging, it makes use of additional angular information by collecting light beams from multiple directions. In practice, light field images (LFIs) are often captured using an array of cameras rather than a single camera. As a result, it adds a new angular domain in addition to the spatial domain inherited from 2D photography. Due to this property, LFI is capable of providing a 6 DoF experience in future immersive media applications. However, the advent of LFI imposes new research challenges related to LFI processing. Particularly, no-reference light field image quality assessment (NR-LFIQA) is of great significance since it can quantify the user’s quality of experience (QoE) and therefore assist intelligent LFI transmission. Due to the unique structural features of LFI, however, current 2D IQA techniques are unable to accurately predict the user-perceived quality. Although there are some recently proposed NR-LFIQA methods such as BELIEF [6] and NR-LFQA [7], their performance is limited as they are fundamentally designed based on 2D IQA methodologies such as naturalness statistics and structural similarity (SSIM) [8]. As a result, they gain sub-optimal performance and predict inaccurately on certain distortion types such as EPICNN [9] and USCD [10] (See Section IV for details).

To overcome the drawbacks of the aforementioned methods, we propose in this paper a more precise NR-LFIQA metric using a new framework termed “auxiliary learning with angular and spatial features via deep anglewise and depthwise separable convolutions (ALAS-DADS)”. We believe that the superiority of the proposed metric is due to the following features. First, since the LFI processing usually suffers from the curse of dimensionality, we propose the light field depthwise separable convolution (LF-DSC)[11] as a low-complexity feature extractor for evaluating the spatial quality of an LFI. Second, to supplement the angular quality information missed by LF-DSC, we further propose the novel light field anglewise separable convolution (LF-ASC) to extract both spatial and angular features for comprehensive quality assessment. Third, we develop an auxiliary learning scheme with two sub-tasks: spatial quality feature estimation and angular quality feature estimation. This scheme leverages both angular and spatial quality features as hints to assists the primary NR-LFIQA task.

The main contributions of this paper are summarised as follows.

  • We theoretically extend the traditional depthwise separable convolution (DSC) to LFI space as LF-DSC. Moreover, we propose the novel LF-ASC for comprehensive LFI feature extraction.

  • Based on LF-DSC and LF-ASC, we propose the first deep auxiliary learning-based model with spatial-angular hints on no-reference light field image quality assessment (NR-LFIQA), which has been verified to outperform the mainstream full reference IQA metrics as well as the state-of-the-art NR-LFIQA methods significantly.

  • We not only show the effectiveness of the proposed LF-DSC, LF-ASC, and the auxiliary learning model in NR-LFIQA, but also shed light on their potential use in other LFI processing tasks such as LFI super-resolution, depth estimation, and so on, paving the way for future LFI research.

The remaining of this paper is structured as follows. In Section II, we review some related works covering light field imaging, IQA, NR-LFIQA, separable convolution, and auxiliary learning. In Section III, we illustrate and describe the overall structure, LF-DSC, LF-ASC, and auxiliary learning components of the proposed metric. We have conducted extensive experiments and analyzed the experimental results in Section IV. In Section V, we conclude our work and discuss future work.

II Related Works

II-A Light Field Imaging

Unlike traditional photography, which records the 2D projection of light rays, an LFI describes light rays from multiple directions. This technology allows capturing richer visual information due to the higher dimensional representation of the data [12]. The theoretical root of LFI is interpreted by a plenoptic function, which is usually denoted as lλ(x,y,z,θ,ϕ,λ,t)l_{\lambda}(x,y,z,\theta,\phi,\lambda,t) where lλ[W/m2/sr/nm/s]l_{\lambda}[W/m^{2}/sr/nm/s] represents spectral radiance per unit time, with (x,y,z)(x,y,z) for the spatial position, (θ,ϕ)(\theta,\phi) for light ray directions, λ\lambda for wavelength, and tt is a temporal instance [12]. With a series of background assumptions, each point in a light field can be denoted as a 4D coordinate (u,v,x,y)(u,v,x,y) where (x,y)(x,y) represents the spatial position and (u,v)(u,v) indicates the angular position [12]. From the digital imaging perspective, each pixel in an LFI can be located with two 2D coordinates: an angular coordinate identifying the located subview in the LFI collection and a spatial coordinate indicating the exact pixel position in that subview similar to a traditional 2D image.

II-B Image Quality Assessment

II-B1 Overview on IQA

According to the availability of original reference data, objective IQA algorithms can be categorised as full-reference IQA (FR-IQA), reduced-reference IQA (RR-IQA), or no-reference IQA (NR-IQA). The FR-IQA metrics assess image quality categorically by comparing the distorted images to the reference images. The RR-IQA metrics only use partial information of the reference image in the IQA process. The NR-IQA metrics, in contrast to the previous two, evaluate the image quality without using original reference images, which is more practical and relevant in real-world situations.

2D IQA is a well-studied problem with numerous 2D FR-IQA metrics proposed such as SSIM, [8], multi-scale SSIM [13], FSIM [14], IWSSIM [15], visual saliency-induced index (VSI) [16], gradient magnitude similarity deviation (GMSD) [17], efficient Minkowski distance-based metric [18], multiple pseudo reference images (MPRIs) similarity index [19], and multiscale edge attention (MSEA) similarity index [20]. Furthermore, some 2D RR-IQA indicators are also introduced. For example, the DIIVINE model measures the local magnitude and phase statistics of natural scenes to evaluate the image quality [21]. BLIINDS-II adopts the generalized Gaussian density function with block discrete cosine transform (DCT) coefficients [21] to extract the quality features. As one of the most popular NR-IQA metrics, BRISQUE uses the scene statistics of locally normalized luminance coefficients to quantify possible losses of “naturalness” owing to the presence of distortions [22].

In comparison to the FR-IQA and RR-IQA techniques, the NR-IQA approaches have more technical challenges since they are entirely based on the distorted image that is being assessed to make a prediction. NR-IQA methods usually detect and extract distortion-specific features that lead to visual quality degradation [23, 24, 25]. However, these methods are incapable of recognising the inherent high-dimensional characteristics of LFI for the purpose of evaluating its visual quality.

II-B2 State-of-the-art Models for NR-LFIQA

Nevertheless, there are few LFI-specific algorithms in the context of NR-LFIQA. Ideally, a comprehensive LFI-specific algorithm should consider both spatial quality and angular consistency. For example, recent work on NR-LFIQA includes BELIEF [6], Tensor-NLFQ [26], LGF-LFC [27], and NR-LFQA [7]. Particularly, NR-LFQA proposed gradient direction distribution (GDD) to measure the deterioration of angular consistency. It involves the statistical analysis of horizontal and vertical gradient direction maps. According to the experimental results, the NR-LFQA outperforms the others. For this reason, we have chosen NR-LFQA as one of the benchmarking models (see Section IV).

Refer to caption
Figure 1: The Architecture of the ALAS-DADS Framework

II-C Separable Convolution and Auxiliary Learning

II-C1 Separable Convolution

Separable convolution is a type of convolutional filter in convolutional neural networks (CNN). CNN is known to perform better with deeper network depth [28]. However, increasing the number of stacked layers (depth) can be as challenging as the vanishing/exploding gradient problem appears during the learning process [28]. Deep residual networks circumvent this problem by using identity skip-connections, which facilitate the training of much deeper networks [29].

Moreover, high computational intensity is always a problem for deep learning. Howard et al. [11] addressed this issue by introducing the 2D DSC in MobileNet. Unlike traditional 2D convolution, the DSC adopts an intermediate channel layer filter containing channel filters. The channel filters then extract each channel’s features separately instead of all channels together like the regular 2D filters. The output of the channel filters then goes through a 1×11\times 1 convolution that combines all channels’ features. The separation between the color feature extraction and combination reinforces the feature extraction process and reduces the computing cost by a polynomial degree. Therefore, we believe that it is worth to explore how to extend DSC to LFI processing to reduce the computational complexity.

II-C2 Auxiliary Learning

According to Ruder [30], the introduction of auxiliary learning could be effective to achieve better performance if the associated features are used as hints for a primary task. Several examples of this approach have proved the feasibility of analogously in the field of natural language processing [30]. Similar to the idea of “hints”, auxiliary learning can also be used to force the learning to focus on the attention in the areas of interest that might be ignored in the regular training process [30]. There are several attempts to apply multi-task learning in NR-IQA. For example, Xu et al. [31] proposed a multi-task learning framework to train multiple distortion type-specialised IQA models together. As an enhancement of Xu’s model, instead of manually configuring models for each distortion type, Ma et al. [32] developed a multi-task deep learning model, MEON, which consists of a distortion identification network and a quality prediction network. Yan et al. [33] introduced the natural statistics feature estimation sub-task to provide the learning model with an intuition for image naturalness.

II-D Summary of Previous Works

In summary, although many IQA metrics have been proposed for the NR-LFIQA in last few decades, the majority of them are not intended for LFI and thus they often ignore to take advantage of the LFI’s angular dimension. Several new NR-LFIQA metrics, including BELIEF and NF-LFQA, have yielded generally acceptable results. They are, however, inept in dealing with some forms of severe distortions such as EPICNN, resulting in instability in IQA prediction with large variations. As stated in Section II-C1, DSC has been shown to be a high-performance and computationally efficient feature extractor even when the input dimension is limited to three dimensions (i.e., the pixel coordinates and a color channel). Thus, it is interesting to investigate the adaptation of DSC from conventional 2D images to LFI in order to substantially decrease the computational cost associated with the high dimensionality. Additionally, the idea of separable convolution may potentially be extended to the angular dimension in the context of LFI for further improvement in feature extraction. Moreover, as mentioned in Section II-C2, the inclusion of auxiliary tasks may assist the model in mastering the primary task, which is NR-LFIQA in our case. Based on the above observations and analysis, we are inspired to present an auxiliary learning based model with effective LF-ASC and LF-DSC feature extractors, which is introduced in next section.

III Methodology

The general structure of the proposed ALAS-DADS framework is explained in Section III-A and shown in Fig. 1. It comprises the LF-DSC module, the LF-ASC module, and the auxiliary learning based NR-LFIQA module. First, due to the fact that the original DSC was developed for 2D images and cannot be directly employed in LFI processing, we expand the concept of DSC to the spatial domain of LFI to form the LF-DSC module (see Section III-B). LF-DSC is a robust spatial feature extractor that is optimised for assessing spatial quality using a small number of parameters. Second, we theoretically expand the concept of LF-DSC to the angular space and introduce the novel concept of LF-ASC (see Section III-C). With significantly reduced computational cost, the LF-ASC can extract both spatial and angular features for complete NR-LFIQA. Third, we propose a novel auxiliary learning based module that exploits both available spatial and angular clues (described as sub-tasks) to aid the NR-LFIQA primary task (see Section III-D). Finally, in Section III-E, we discuss how the proposed framework and techniques can be generalized in other LFI processing tasks.

III-A Model Structure

The ALAS-DADS framework is visualised in Fig. 1 (a). The framework consists of two stride-1/stride-2 LF-DSC blocks and three LF-ASC blocks that are trained via auxiliary learning. The walkthrough of the proposed model can be summarised into the following process:

  1. 1.

    Data augmentation is conducted to increase the dataset’s size by eightfold. (see Section IV-A2).

  2. 2.

    Normalization, i.e., for an LFI nn:

    pn^(D)=pn(D)μ(D)σ(D)+1,D=(u,v,x,y,c)\hat{p_{n}}(D)=\frac{p_{n}(D)-\mu(D)}{\sigma(D)+1},\forall D=(u,v,x,y,c) (1)

    where (u,v)(u,v) identifies a subview and (x,y,c)(x,y,c) represents a pixel and its color channel. μ\mu and σ\sigma are mean and standard deviation of the corresponding pixel channel.

  3. 3.

    A subview-wise stride-2 2D convolution is performed to reduce the spatial size.

  4. 4.

    Three LF-ASCs are performed to extract angular as well as spatial features efficiently (see Section III-C).

  5. 5.

    A stride-4 max pooling is applied to decrease the spatial size.

  6. 6.

    Two stride-1/stride-2 LF-DSCs are performed for efficient spatial and angular feature extraction (see Section III-B).

  7. 7.

    A pointwise convolution with the following stride-2 max pooling is applied to combine the channel features.

  8. 8.

    The model outputs the NR-LFIQA score with auxiliary learning after fully connected neural networks (FC-NN) (see Section III-D).

To give more model details, TABLE I lists the filter shapes as well as the input size of each model layer. The input shape is 7×7×434×434×37\times 7\times 434\times 434\times 3, and the model body will extract 7×7×3×3×10247\times 7\times 3\times 3\times 1024 features to feed in the prediction module. The module includes an FC-NN to predict the quality scores as the primary task, and two double 256-node layers to estimate spatial features and angular features, respectively.

TABLE I: Model Details of ALAS-DADS
Type / Stride Filter Shape Input Size
Model Body
2D Conv / s2 3×3×3×33\times 3\times 3\times 3 7×7×434×434×37\times 7\times 434\times 434\times 3
AW Conv / s1 7×7×4×4×37\times 7\times 4\times 4\times 3 7×7×217×217×37\times 7\times 217\times 217\times 3
AW ConvBloc / s1 7×7×4×4×37\times 7\times 4\times 4\times 3 7×7×217×217×37\times 7\times 217\times 217\times 3
AW ConvBloc / s1 7×7×4×4×37\times 7\times 4\times 4\times 3 7×7×217×217×37\times 7\times 217\times 217\times 3
Max Pooling / s4 N/A 7×7×217×217×37\times 7\times 217\times 217\times 3
DW ConvBloc / s1 7×7×4×4×37\times 7\times 4\times 4\times 3 7×7×54×54×37\times 7\times 54\times 54\times 3
DW ConvBloc / s2 7×7×4×4×127\times 7\times 4\times 4\times 12 7×7×54×54×37\times 7\times 54\times 54\times 3
DW ConvBloc / s1 7×7×4×4×127\times 7\times 4\times 4\times 12 7×7×27×27×127\times 7\times 27\times 27\times 12
DW ConvBloc / s2 7×7×4×4×487\times 7\times 4\times 4\times 48 7×7×27×27×127\times 7\times 27\times 27\times 12
DW ConvBloc / s1 7×7×4×4×487\times 7\times 4\times 4\times 48 7×7×14×14×487\times 7\times 14\times 14\times 48
DW ConvBloc / s2 7×7×4×4×1927\times 7\times 4\times 4\times 192 7×7×14×14×487\times 7\times 14\times 14\times 48
Pointwise Conv / s1 1×1×1×10241\times 1\times 1\times 1024 7×7×7×7×1927\times 7\times 7\times 7\times 192
Max Pooling / s2 N/A 7×7×7×7×10247\times 7\times 7\times 7\times 1024
Primary Task: IQA Score Prediction
FC-NN 451584451584 7×7×3×3×10247\times 7\times 3\times 3\times 1024
Output N/A 11
Auxiliary Task 1: Spatial Feature Estimation
FC-NN 256256 10241024
FC-NN 256256 10241024
Output N/A 3636
Auxiliary Task 2: Angular Feature Estimation
FC-NN 256256 10241024
FC-NN 256256 10241024
Output N/A 88

III-B Depthwise Separable Convolutions for LFI

As stated in Section II-C1, the DSC was initially developed for 2D images, which handle not only spatial dimensions but also depth dimensions (i.e., the number of channels). It converts a normal convolution to a depthwise convolution, then performs a 1×11\times 1 convolution (i.e., pointwise convolution) to combine the outputs from the preceding layers. The primary difference between DSC and regular convolution is that the DSC separates the filtering features and combines the outputs into two layers instead of one step. Due to the fact that this factorization lowers the computing dimensions involved, it significantly reduces the computational complexity and the model size. To achieve these benefits, we expand the DSC into the spatial domain of the LFI to resolve the curse of dimensionality in LFI. Specifically, each subview in an LFI is treated as a 2D image that is fed into a depthwise convolution followed by a pointwise convolution. Fig. 1 (c) illustrates the internal structure of the LF-DSC. It includes a pointwise convolution, an LF-DSC, and a second pointwise convolution for combining the channel features. Suppose a subview-wise 2D convolution takes an ui×vi×xi×yi×ciu_{i}\times v_{i}\times x_{i}\times y_{i}\times c_{i} input tensor TiT_{i} with kernel KjRkj×kj×ci×cjK_{j}\in R^{k_{j}\times k_{j}\times c_{i}\times c_{j}}, we can calculate the computational cost as:

uivixiyicicjkjkju_{i}\cdot v_{i}\cdot x_{i}\cdot y_{i}\cdot c_{i}\cdot c_{j}\cdot k_{j}\cdot k_{j} (2)

The proposed LF-DSC can achieve similar performance but only takes:

uivixiyici(cj+kj2)u_{i}\cdot v_{i}\cdot x_{i}\cdot y_{i}\cdot c_{i}(c_{j}+k_{j}^{2}) (3)

Therefore, by adopting the LF-DSC, we can save the computational cost of:

uivixiyici[(cj1)(kj21)1]u_{i}\cdot v_{i}\cdot x_{i}\cdot y_{i}\cdot c_{i}[(c_{j}-1)\cdot(k_{j}^{2}-1)-1] (4)

Lastly, we have two types of LF-DCS, including the stride-2 LF-DSC, which reduces the spatial dimensions and enriches the channel dimensions; and the stride-1 LF-DSC, which remains the input dimensions but applies residual shortcuts for more efficient training.

III-C Anglewise Separable Convolutions for LFI

LF-DSC is mainly designed to concentrate on spatial feature extraction. To supplement the angular quality information missed by LF-DSC, we theoretically extend LF-DSC to the angular domain of LFIs and introduce the novel concept of LF-ASC. Since LF-ASC is designed for the distinctive structure of LFIs, it makes extensive use of the extra angular information to achieve high-performance feature extraction at a low computational cost. Fig. 1 (b) illustrates the structure of an LF-ASC module. LF-ASC divides a one-step 4D convolution with kernel KjRaj×aj×kj×kj×ci×cjK_{j}\in R^{a_{j}\times a_{j}\times k_{j}\times k_{j}\times c_{i}\times c_{j}} into two lower-dimensional convolutions: a horizontal HjRaj×kj×kj×ci×cjH_{j}\in R^{a_{j}\times k_{j}\times k_{j}\times c_{i}\times c_{j}} and a vertical VjRaj×kj×kj×ci×cjV_{j}\in R^{a_{j}\times k_{j}\times k_{j}\times c_{i}\times c_{j}}. Hence, we succeed in reducing multiplications from one 4D convolution with aj2a_{j}^{2} to two 3D convolutions with 2aj2a_{j} in total to achieve comparable feature extraction performance.

Suppose a standard 4D convolution takes a ui×vi×xi×yi×ciu_{i}\times v_{i}\times x_{i}\times y_{i}\times c_{i} input tensor TiT_{i} with kernel KjRaj×aj×kj×kj×ci×cjK_{j}\in R^{a_{j}\times a_{j}\times k_{j}\times k_{j}\times c_{i}\times c_{j}}, we can calculate the computational cost as:

uivixiyicicjkj2aj2u_{i}\cdot v_{i}\cdot x_{i}\cdot y_{i}\cdot c_{i}\cdot c_{j}\cdot k_{j}^{2}\cdot a_{j}^{2} (5)

The proposed LF-ASC can obtain similar performance but reduce the computational cost significantly with:

uivixiyicicjkj22aju_{i}\cdot v_{i}\cdot x_{i}\cdot y_{i}\cdot c_{i}\cdot c_{j}\cdot k_{j}^{2}\cdot 2a_{j} (6)

The proposed LF-DSC and LF-ASC together can perform efficient feature extraction only at a cost of:

uivixiyici(cj+kj2+2cjajkj2)u_{i}\cdot v_{i}\cdot x_{i}\cdot y_{i}\cdot c_{i}(c_{j}+k_{j}^{2}+2c_{j}a_{j}k_{j}^{2}) (7)

Therefore, by adopting the combination of the LF-DSC and LF-ASC instead of a 4D convolution, we can significantly reduce the computational cost by:

uivixiyici(cjaj2kj22cjajkj2kj2cj)u_{i}\cdot v_{i}\cdot x_{i}\cdot y_{i}\cdot c_{i}(c_{j}a_{j}^{2}k_{j}^{2}-2c_{j}a_{j}k_{j}^{2}-k_{j}^{2}-c_{j}) (8)

III-D Auxiliary Learning for NR-LFIQA

To further improve the performance, we configure spatial feature estimation and angular feature estimation as two auxiliary tasks that support the primary NR-LFIQA task. The primary task shares the same hidden layers and weights with the two auxiliary tasks to take both spatial and angular quality into consideration. Nonetheless, each task has its own FC-NN to be optimised before the output.

For the labels of the auxiliary tasks, natural scene statistics (NSS) features from BRISQUE are adopted as the spatial quality features [22] due to their reliability (as discussed in Section II-B1), while the global direction distribution (GDD) features are used as the angular quality features according to the state-of-the-art NR-LFQA (as discussed in Section II-B2). Since BRISQUE is designed for 2D images, we take the average NSS features of all subviews as the labels for an LFI. The primary task is output with a single FC-NN while each auxiliary task uses a global average pooling to shrink the dimensions, followed by a double 256-node FC-NN to optimize the loss function.

We leverage the mean squared error (MSE) as the loss function for each task. The loss function of the primary task is formulated as:

lp=1Ni=1N(yiyi^)2l_{p}=\frac{1}{N}\sum_{i=1}^{N}(y_{i}-\hat{y_{i}})^{2} (9)

where NN is the total number of input LFIs, yiy_{i} denotes the true quality score, and yi^\hat{y_{i}} represents the predicted score.

The loss function for the auxiliary task of spatial feature estimation can be formulated as:

ls=1DsNi=1Nj=1Ds(sijsij^)2l_{s}=\frac{1}{D_{s}N}\sum_{i=1}^{N}\sum_{j=1}^{D_{s}}(s_{ij}-\hat{s_{ij}})^{2} (10)

where sijs_{ij} stands for each element from a spatial feature vector SiS_{i} while sij^\hat{s_{ij}} is the corresponding predicted value. Since we adopted a 36×136\times 1 NSS feature vector from BRISQUE, the dimension of the spatial feature DsD_{s} is 3636.

The loss function for the auxiliary task of angular feature estimation can be formulated as:

la=1DaNi=1Nj=1Da(aijaij^)2l_{a}=\frac{1}{D_{a}N}\sum_{i=1}^{N}\sum_{j=1}^{D_{a}}(a_{ij}-\hat{a_{ij}})^{2} (11)

where aija_{ij} captures each element from an angular feature vector AiA_{i} while aij^\hat{a_{ij}} is the corresponding predicted value. Since we adopted an 8×18\times 1 GDD feature vector from the state-of-the-art NR-LFQA, the dimension of angular feature DaD_{a} is 8.

The total training loss of the ALAS-DADS is defined as:

L=lp+λ(ls+la)L=l_{p}+\lambda(l_{s}+l_{a}) (12)

where λ\lambda is the balancing factor between the main and the auxiliary tasks. In related work such as [33, 34, 35, 36], the typical values for λ\lambda are 1, 0.1 and 0.01. In practice, we trained our framework with 1, 0.1, 0.01, and 0.001, and we chose 0.01 according to the experimental results. Additionally, we normalised the labels of the quality features for the auxiliary activities in order to scale the loss from three sources (i.e., quality score, spatial quality features, and angular quality features) to a comparable level and guarantee the effectiveness of the factor λ\lambda.

Therefore, the framework outputs a predicted quality score, the estimated spatial quality features, and the estimated angular quality features in response to an LFI. The determination of the quality scores is dependent on the labelling of the dataset on which the model is trained (see Section IV-A1). Moreover, we use natural scene statistics (NSS) (i.e., a 36×136\times 1 vector) from BRISQUE as spatial quality features and global direction distribution (GDD) (i.e., an 8×18\times 1 vector) from the state-of-the-art NR-LFQA as angular quality features.

Refer to caption
Figure 2: Data Augmentation for a Sample LFI from Win5-LID [37]

III-E Discussion on Adaptation

We believe that our proposed techniques, such as LF-DSC and LF-ASC, and the related auxiliary learning scheme, can be easily adapted to other LFI processing tasks, such as LFI super-resolution, LFI image classification, and LFI depth estimation. First, similar to how DSC is used in a variety of 2D computer vision applications [38, 39, 40], the LF-DSC and LF-ASC can be used as highly efficient generative LFI feature extractors for these tasks. Second, the ALAS-DADS architecture as a whole can be easily modified to perform various LFI processing tasks. For example, Lu et al. [41] developed an LFI classification model that contains a VGGNet with an interleaved CNN refiner. One potential adaptation is to substitute the interleaved CNN in this classification model with LF-DSC and LF-ASC for more efficient feature extraction. Besides, another potential adaptation is to adopt ALAS-DADS as the framework replacing the VGGNet since VGGNet is designed for 2D images. In this instance, the primary task would be LFI classification, while the auxiliary tasks could include estimating the saliency map, which may provide useful hints to the primary task.

IV Experiments

In this section, we firstly conduct an ablation study to show the effectiveness of LF-DSC and LF-ASC, and then conduct a series of experiments on mainstream LFI datasets, namely Win5-LID and SMART, for comparing the proposed metric with representative state-of-the-art metrics. Specifically, Section IV-A describes the experimental setup, including the datasets (see Section IV-A1), the data augmentation process (see Section IV-A2), and the performance measures (see Section IV-A3). A training scheme dedicatedly designed for LFI is introduced in Section II. This design enables the learning of the high dimensional LFI data with limited computational resources. We also discuss the generalization of the proposed ALAS-DADS model at the end of Section II. In Section IV-C, we interpret the results of the ablation study to show the effectiveness and learning efficiency of LF-DSC and LF-ASC in comparison to 4D convolution. Finally, we discuss the results of the benchmarking experiments in IV-D, which includes the performance benchmarking of the proposed metric against other metrics (see Section IV-D1), performance analysis of the proposed metric in various distortion types (see Section IV-D2), and some sample predictions from the proposed metric (see Section IV-D3).

IV-A Experimental Setup

IV-A1 Datasets

We use two mainstream publicly available labelled datasets, namely Win5-LID [37] and SMART [42], for evaluating the NR-LFIQA metrics. Since the LFIs from those two datasets have different image sizes, we trim and reshape all images into the same size of 7×7×434×434×37\times 7\times 434\times 434\times 3, while remaining the most informative image parts according to [43]. The Win5-LID dataset contains 220 LFIs, which are applied with 6 distortion types, including HEVC, JPEG 2000, linear interpolation (LN), nearest-neighbour (NN) interpolation, EPICNN, and USCD with five distorting levels. The quality of the 220 LFIs in Win5-LID was rated by participants on a 5-point discrete scale, under a double-stimulus continuous quality scale (DSCQS). The overall mean opinion scores (MOS) were calculated for each LFI. The SMART dataset is based on 16 original LFIs with 256 distorted LFIs obtained by 4 compression distortions, i.e., HEVC Intra, JPEG, JPEG 2000, and SSDC. The subjective ratings were gathered at the scale of the Bradley-Terry (BT) scores.

There are several points to be noted about the model output. In Win5-LID, the quality score is the mean opinion score (MOS) (which ranges from 1 to 5, with higher values indicating higher quality), whereas it is BT score (which typically ranges from -10 to 1, with higher values indicating higher quality) in SMART. The majority of spatial quality feature values are in the range (-2, 2), while the majority of angular quality feature values are in the range (-1, 1).

IV-A2 Data Augmentation

We significantly enlarge the current datasets by data augmentation to generate more data for training, validation, and testing. To this end, according to [44], geometric transformations are facilitated on LFIs, including rotations at different angles and vertical flipping. Although brightness, contrast, or Gaussian noise are widely used for data augmentation [44] too, those techniques are not employed because they may affect the user-perceived quality of the LFIs. Moreover, cropping is not employed either as it may impair the structural quality of the LFI. Thus, finally, we rotate the LFIs in the datasets by 90, 180, and 270 degrees, and perform vertical flipping. Finally, we increase the datasets by a factor of eight, bringing the Win5-LID to 1760 LFIs and SMART to 2048 LFIs. An example of such data augmentation is shown in Fig. 2.

IV-A3 Performance Measures

We employ three measures to evaluate the IQA metrics’ performance, including root mean square error (RMSE) [45], spearman rank order correlation coefficient (SROCC) [46], and Pearson linear correlation coefficient (PLCC) [45]. The smaller the RMSE or the greater the SROCC or PLCC, the more consistent the IQA model’s output is with the human perception of the LFI.

IV-B Training and Generalization

TABLE II: Statistics of Training
Dataset Training Loss Validation Loss Testing Loss Time (s)
Win5-LID 0.30810.3081 0.32320.3232 0.36690.3669 6343
SMART 0.45600.4560 0.66060.6606 0.78260.7826 14881

We implemented the proposed models with Python Tensorflow, and trained them with the hardware configurations of an AMD Ryzen 7 3700X CPU, two NVIDIA RTX2070S GPUs, and 64 GB RAM. Since LFI data is high-dimensional, it is infeasible to load many LFIs simultaneously for training due to limited memory resources. Instead, we design a dedicated training scheme for efficiently processing the high dimensional LFI data. The datasets are firstly split into training and testing segments at the ratio of 0.80.8 and 0.20.2. We then use stochastic learning for training and validation. For each tiny training batch:

  • mm training images are randomly drawn from the training dataset without replacement.

  • nn validation images are randomly drawn from the training dataset with replacement.

  • For each epoch, the model monitors the validation performance and will early stop and switch to the next batch if:

    • The performance did not improve for several epochs pp (i.e., patience).

    • Or it reaches the epoch limit ll of a tiny training batch.

Under this training scheme, it is possible to tune the training parameters to control the training process. For example, the epoch limit ll can be configured as a relatively small number (e.g., 5) to avoid overfitting. For optimization, Adam with AMS gradient [47] is applied, which stores “long-term memory” of past gradients. A new variant of the Adam algorithm is performed to solve the convergence issues and improve the performance. Since we have two GPUs, we implemented a parallel training scheme to improve the training efficiency. Due to the communication loss, Tensorflow’s default distributed framework does not take advantage of multi-GPU computing. As a result, we train a model independently on two GPUs and update both with the model with the lowest validation loss in every 100 batches. Finally, as shown in TABLE II, it took 6343 seconds for training the model on Win5-LID and 14881 seconds on SMART.

The deep learning-based methods are often questioned with overfitting problems. To address this issue, we use residual shortcut and dropout for regularization of the CNN in addition to our data augmentation for more generalization. More importantly, we split the dataset by training and testing at the beginning. During the whole training and validation process, only the training segment is involved, while all the experimental results are based on the independent test segment without retraining. The statistics of the training process from TABLE II prove that the ALAS-DADS model gains an excellent trade-off between overfitting and underfitting: the training loss is slightly less than the validation loss while either of them is less than the test errors.

TABLE III: Ablation Study: 10-4D-Conv vs 10-LF-DSC vs 10-LF-ASC vs 10-LF-DSC-ASC
Model Tr. Param. Time (s) RMSE SROCC PLCC
10-4D-Conv 374,072 1587 2.08742.0874 0.09640.0964 0.11400.1140
10-LF-DSC 25,232 851 0.98540.9854 0.16970.1697 0.19810.1981
10-LF-ASC 20,222 690 1.02201.0220 0.18520.1852 0.26740.2674
10-LF-DSC-ASC 45,452 1095 0.93560.9356 0.26020.2602 0.30450.3045

IV-C Effectiveness of LF-DSC and LF-ASC

We performed an ablation study prior to the benchmarking experiments to demonstrate the efficacy and learning efficiency of LF-DSC and LF-ASC in comparison to 4D convolution. We developed four testing models: 10-4D-Conv, 10-LF-DSC, 10-LF-ASC, and 10-LF-DSC-ASC. Except for the backbone, the construction of the four models is identical. For example, ten four-dimensional convolution layers form the basis of 10-4D-Conv. Similarly, the 10-LF-DSC, 10-LF-ASC, and 10-LF-DSC-ASC configurations include ten LF-DSCs, ten LF-ASCs, and ten LF-DSC LF-ASC combinations, respectively. The models are then evenly trained on Win5-LID using 100 batches of 5 epochs each. Each batch contains two LFIs for training and two LFIs for validation. TABLE III summarises the results of the ablation study. There are two things to be noted: 1) the training times in the table include data loading but not the evaluation time; and 2) the performance measures (i.e., RMSE, SROCC, and PLCC) are computed using the testing data. The number of trainable parameters in 10-4D-Conv is much larger than those in the other three models. While 10-LF-DSC-ASC includes a total of twenty layers, including ten LF-DSCs and ten LF-ASCs, it nevertheless has a much smaller number of trainable parameters and training duration than 10-4D-Conv. Additionally, despite the shorter learning period, the 10-LF-DSC and 10-LF-ASC beat the 10-4D-Conv, with the 10-LF-DSC-ASC achieving the best performance. Thus, the results of this ablation study confirms the theoretical conclusion we made in Section III that LF-DSC and LF-ASC can attain higher NR-LFIQA performance at a reduced computational cost.

IV-D Benchmarking Experiments

To validate the proposed metric’s performance, we conduct an experiment comparing it to traditional metrics (i.e., PSNR, SSIM, and BRISQUE) and the state-of-the-art LFI metric (i.e., NR-LFQA). It is worth to mention that we acquire the source code of NR-LFQA from the original authors and retrain the model using identical experimental conditions. Because PSNR and SSIM are both FR-IQA metrics, we use the original undistorted images in both datasets. Apart from that, we use the average quality score of each subview for the predictions of BRISQUE.

TABLE IV: Overall Results of Tested Metrics
Win5-LID SMART
Metrics RMSE SROCC PLCC RMSE SROCC PLCC
PSNR 0.84690.8469 0.65790.6579 0.65770.6577 3.03673.0367 0.69960.6996 0.65300.6530
SSIM 1.32531.3253 0.55380.5538 0.55230.5523 2.39462.3946 0.60430.6043 0.61950.6195
BRISQUE 0.79010.7901 0.56470.5647 0.61110.6111 1.44801.4480 0.63310.6331 0.73980.7398
NR-LFQA 0.64210.6421 0.73460.7346 0.74510.7451 1.75001.7500 0.69440.6944 0.77840.7784
ALAS-DADS 0.36690.3669 0.92600.9260 0.92570.9257 0.78260.7826 0.85400.8540 0.93440.9344
TABLE V: RMSE and PLCC of Tested Metrics in Different Types of Distortions
Distortion PSNR SSIM BRISQUE NR-LFQA ALAS-DADS
Win5-LID RMSE PLCC RMSE PLCC RMSE PLCC RMSE PLCC RMSE PLCC
EPICNN 1.71681.7168 0.68340.6834 1.01141.0114 0.43740.4374 0.88950.8895 0.79490.7949 0.84750.8475 0.81310.8131 0.20020.2002 0.98950.9895
HEVC 1.01461.0146 0.57700.5770 1.12181.1218 0.63540.6354 0.85170.8517 0.66600.6660 0.66170.6617 0.83070.8307 0.44020.4402 0.92810.9281
JPEG 2000 0.80070.8007 0.69510.6951 1.09531.0953 0.54470.5447 0.67290.6729 0.71780.7178 0.63060.6306 0.74510.7451 0.36040.3604 0.92300.9230
LN 0.56710.5671 0.80880.8088 1.63861.6386 0.70390.7039 0.79670.7967 0.47840.4784 0.77270.7727 0.56090.5609 0.32630.3263 0.92980.9298
NN 0.60700.6070 0.80440.8044 1.34901.3490 0.79320.7932 0.68040.6804 0.44590.4459 0.37970.3797 0.85650.8565 0.35600.3560 0.87580.8758
USCD 1.01421.0142 0.53570.5357 1.67461.6746 0.16300.1630 1.18191.1819 0.45920.4592 0.88020.8802 0.47980.4798 0.33000.3300 0.95360.9536
Overall 0.84690.8469 0.65770.6577 1.32531.3253 0.55230.5523 0.79010.7901 0.61110.6111 0.64210.6421 0.74510.7451 0.36690.3669 0.92570.9257
SMART RMSE PLCC RMSE PLCC RMSE PLCC RMSE PLCC RMSE PLCC
HEVC 2.77762.7776 0.76670.7667 2.51742.5174 0.66440.6644 1.43051.4305 0.89750.8975 1.77011.7701 0.83970.8397 0.77330.7733 0.96070.9607
JPEG 2.22592.2259 0.58680.5868 2.87262.8726 0.46000.4600 1.35841.3584 0.41730.4173 1.30601.3060 0.64450.6445 0.81530.8153 0.82400.8240
JPEG 2000 3.55703.5570 0.77440.7744 2.20632.2063 0.74860.7486 1.56901.5690 0.74760.7476 2.13272.1327 0.80210.8021 0.73130.7313 0.95550.9555
SSDC 3.40053.4005 0.61310.6131 1.86531.8653 0.65540.6554 1.38781.3878 0.53460.5346 1.68851.6885 0.69460.6946 0.80790.8079 0.88990.8899
Overall 3.03673.0367 0.65300.6530 2.39462.3946 0.61950.6195 1.44801.4480 0.73980.7398 1.75001.7500 0.77840.7784 0.78260.7826 0.93440.9344

IV-D1 Benchmarking Results

TABLE IV shows the overall testing results of all the tested metrics. In general, ALAS-DADS significantly outperforms the others in either Win5-LID or SMART. In Win5-LID specifically, ALAS-DADS gains 0.27520.2752 RMSE reduction as well as 0.19140.1914 and 0.18060.1806 improvements on SROCC and PLCC respectively, compared with the state-or-art NR-LFQA model. Moreover, ALAS-DADS gains a more significant performance boost in SMART where it achieves 0.66540.6654 less RMSE than BRISQUE (the second-best) and increases the SROCC and PLCC by 0.15960.1596 and 0.15600.1560, respectively, compared to the state-of-the-art NR-LFQA.

The performance of the tested metrics in different distortion types are presented in TABLE V. One can see that ALAS-DADS gains the lowest RMSE for all the distortion types in both datasets. In the most challenging distortion types of Win5-LID, e.g., EPICNN and USCD, the ALAS-DADS achieves the stunning 0.20020.2002 and 0.33000.3300 of RMSE, respectively, while the other metrics obtain more than 0.84000.8400 in both distortion types. In addition, it is noticed that the ALAS-DADS can also substantially shrink the RMSE for SMART in each distortion type compared to other metrics.

Refer to caption
Figure 3: Comparison of Regression and Distortion Scatter Plots Between ALAS-DADS and Other FR-IQA Metrics
Refer to caption
Figure 4: Comparison of Regression and Distortion Scatter Plots Between ALAS-DADS and Other NR-IQA Metrics

Fig. 3 and 4 visualise the benchmarking results with the FR-IQAs as well as the NR-IQAs in each dataset. Both figures can be read from left to right to notice the significant improvement the proposed model achieved. In the comparison between regression scatter plots, ALAS-DADS’s predicted quality scores are more centralized and closer to the red line (i.e., the perfect prediction) than those of the other metrics on both datasets. The blue line (i.e., linear regression) almost overlaps the red line, which implies that ALAS-DADS predicts not only more unbiased but also with lower variances compared with other metrics. From the scatter plots of each distortion type, we can draw the same conclusion: in each type of distortion, ALAS-DADS can consistently achieve more accurate quality scores than the others.

To sum up, ALAS-DADS gains 0.27520.2752 (i.e., 42.86%) RMSE reduction and 0.19140.1914 and 0.18060.1806 improvements on SROCC and PLCC, respectively, compared with the second-best state-of-the-art NR-LFQA model in Win5-LID. Moreover, ALAS-DADS gains a more prominent performance boost in SMART where it achieves 0.66540.6654 (i.e., 45.95%) less RMSE than the second-best BRISQUE, and increases the SROCC and PLCC by 0.15960.1596 (i.e., 22.98%) and 0.15600.1560 (i.e., 20.04%) respectively, compared to the state-of-the-art NR-LFQA. In the most challenge distortion types (i.e., EPICNN and USCD in Win5-LID), ALAS-DADS achieves the stunning 0.20020.2002 (i.e., 76.38% better than the state-of-the-art) and 0.33000.3300 (i.e., 62.51% better than the state-of-the-art) of RMSE, respectively, while the other metrics obtain more than 0.84000.8400 in both distortion types.

IV-D2 Performance Analysis

To further visualize the performance yielded by the proposed model, Fig. 5 shows the scatter plots of the predicted quality scores versus ground truth. In Fig. 5 (a) and (d), the x-axis demonstrates distributions of the quality score predictions and ground truth where colored hexagons indicate the density of points in that area. Fig. 5 (a) is based on the testing results in Win5-LID, where the predicted values are in a normal distribution while the ground truth values have a more flat shape. Fig. 5 (d) is based on SMART where either predicted values or ground truth are like a normal distribution.

Refer to caption
Figure 5: Scatter Plots Showing Predicted Values of ALAS-DADS vs Ground Truth Values

The blue lines in Fig. 5 (b) and (e) are the linear regression of all the prediction points, and the red dashed line is the perfect prediction line with the slop of 1, which is used as the anchoring line. In either Fig. 5 (b) or (e), it can be seen that the linear regression of the predicted values output from ALAS-DADS is very close to the perfect prediction line. The predictions in Fig. 5 (e) are more concentrated on the ideal prediction line than those in Fig. 5 (b). As shown in Fig. 5 (a) and (d), the distribution of ground truth of SMART is more predictable (i.e., closer to a normal distribution) than that of Win5-LID, which may bring the lower variance in the ALAS-DADS’s predictions for SMART.

TABLE VI: Performance Details of ALAS-DADS in Different Distortion Types
Dataset Distortion RMSE SROCC PLCC
Win5-LID EPICNN 0.20020.2002 0.95360.9536 0.98950.9895
HEVC 0.44020.4402 0.93690.9369 0.92810.9281
JPEG 2000 0.36040.3604 0.91430.9143 0.92300.9230
LN 0.32630.3263 0.93630.9363 0.92980.9298
NN 0.35600.3560 0.87850.8785 0.87580.8758
USCD 0.33000.3300 0.89370.8937 0.95360.9536
Overall 0.36690.3669 0.92600.9260 0.92570.9257
SMART HEVC 0.77330.7733 0.91940.9194 0.96070.9607
JPEG 0.81530.8153 0.74040.7404 0.82400.8240
JPEG 2000 0.73130.7313 0.92880.9288 0.95550.9555
SSDC 0.80790.8079 0.68420.6842 0.88990.8899
Overall 0.78260.7826 0.85400.8540 0.93440.9344

Fig. 5 (c) and (f) show the scatter plots of the predicted values and the ground truth in each type of distortions on each dataset, showing that ALAS-DADS predicts very close to the ground truth in each distortion type (performance details are shown in TABLE VI). In dataset Win5-LID, it turns out that ALAS-DADS performs well at predicting the quality score of EPICNN distorted LFIs, achieving the smallest RMSE of 0.20020.2002 and the best PLCC of 0.98950.9895 (i.e., extremely close to the perfect 11), while it obtains the least SROCC and PLCC in LFIs distorted by nearest-neighbour interpolation. In the SMART dataset, ALAS-DADS tends to predict more accurately for the HEVC and JPEG 2000 distorted LFIs than the JPEG and SSDC distorted LFIs. In the above results, we noticed the performance variations in predictions for different distortion types. We believe this might be affected by the ALAS-DADS’s adoption of the spatial and angular quality features (e.g., NSS from BRISQUE as spatial quality feature and GDD from NR-LFQA as angular quality feature), which act as hints for auxiliary learning. For example, both NR-LFQA with NSS and BRISQUE with GDD perform well in JPEG 2000, which could help the ALAS-DADS to obtain a higher accuracy for JPEG 2000, too.

IV-D3 Sample Predictions

Refer to caption
Figure 6: Sample LFIs from Win5-LID
Refer to caption
Figure 7: Sample LFIs from SMART
TABLE VII: Example Image Results for Fig. 6 and Fig. 7
Image True Method Pred ||Pred-True||
Win5-LID
HEVC-Sphynx-44 1.21741.2174 BRISQUE 1.28351.2835 0.06610.0661
NR-LFQA 1.50411.5041 0.28670.2867
ALAS-DADS 1.21251.2125 0.00490.0049
JPEG2000-Palais-du- Luxembourg- 100+rot90 2.91302.9130 BRISQUE 3.09013.0901 0.17700.1770
NR-LFQA 3.15563.1556 0.24250.2425
ALAS-DADS 2.91912.9191 0.00600.0060
USCD-Vespa +rot180+flip 3.13043.1304 BRISQUE 3.98643.9864 0.85590.8559
NR-LFQA 3.75003.7500 0.61960.6196
ALAS-DADS 3.13673.1367 0.00620.0062
LN-museum-10 +rot270+flip 4.39134.3913 BRISQUE 3.46703.4670 0.92430.9243
NR-LFQA 2.67582.6758 0.15030.1503
ALAS-DADS 2.81902.8190 0.00700.0070
JPEG2000-museum- 150+rot90+flip 2.82612.8261 BRISQUE 3.46703.4670 0.92430.9243
NR-LFQA 3.47953.4795 0.91180.9118
ALAS-DADS 4.38504.3850 0.00630.0063
SMART
JPEG-485-70 +rot90+flip 2.9399-2.9399 BRISQUE 4.1971-4.1971 0.73190.7319
NR-LFQA 2.4821-2.4821 0.45790.4579
ALAS-DADS 3.4729-3.4729 0.00780.0078
JPEG-481-70 +rot180+flip 3.4652-3.4652 BRISQUE 4.1971-4.1971 0.73190.7319
NR-LFQA 4.5941-4.5941 1.12901.1290
ALAS-DADS 2.9453-2.9453 0.00540.0054
JPEG2000-465- 100+rot90 3.13043.1304 BRISQUE 6.1960-6.1960 0.82390.8239
NR-LFQA 5.5500-5.5500 0.17790.1779
ALAS-DADS 6.1495-6.1495 0.01180.0118
JPEG2000-405- 100+rot180+flip 4.39134.3913 BRISQUE 6.1064-6.1064 0.05490.0549
NR-LFQA 6.6884-6.6884 0.52710.5271
ALAS-DADS 5.3621-5.3621 0.01000.0100
SSDC-417-42 2.3157-2.3157 BRISQUE 1.9397-1.9397 0.37590.3759
NR-LFQA 1.7745-1.7745 0.54110.5411
ALAS-DADS 2.3038-2.3038 0.01180.0118

From each dataset, we choose five sample LFIs for demonstrating the prediction results. The results are shown in TABLE VII and the associated images are displayed in Fig. 6 and 7. For the samples of Win5-LID, the differences between the predicted values and the ground truth values are all smaller than 0.010.01, which is significantly less than that of the other NF-LFIQA metrics. For SMART, the ALAS-DADS likewise can achieve remarkable performance with errors averaging about 0.010.01.

V Conclusion

In this paper, we propose an improved NR-LFIQA metric called ALAS-DADS. To the best of our knowledge, this work is the first exploration of deep auxiliary learning with spatial-angular hints on NR-LFIQA. Particularly, the proposed ALAS-DADS equips LF-DSC and LF-ASC, which both are highly effective and low-complexity feature extractors. To utilize the spatial and angular hints obtained from these feature extractors, we design an auxiliary tasking scheme consisting of spatial and angular feature estimation sub-tasks to assist the primary task of quality assessment for more accurate IQA results. The experimental results show that ALAS-DADS significantly outperforms both mainstream FR-IQA metrics and the state-of-the-art NR-LFIQA methods. Overall, it achieves more than 40% performance boost and showed robustness in various distortion types. In some categories, it gains significant improvement of more than 60%.

However, we are aware that our work is limited in the following aspects. First, our exploration of feature labels in the auxiliary tasks may not be exhaustive. In our future work, we will attempt updating the labels of auxiliary tasks with more advanced features, e.g., updating either BRISQUE features or GDD features for further improvement of the model. Second, our investigation on the auxiliary tasks may not be thorough. In the future, we will attempt introducing more auxiliary tasks, e.g., a new learning branch (sub-task) estimating the saliency map [48] to assist the primary IQA task. Third, due to the limitation of current public LFI datasets, we have not evaluated ALAS-DADS’s performance with some distortion types, such as JPEG Pleno compression [49]. In our future work, we plan to establish a new LFI dataset or augment the current LFI datasets covering more advanced compression techniques including JPEG Pleno, conduct subjective experiments, and evaluate our improved model accordingly.

Despite these limitations, we believe that our work contributes to the advancement of NR-LFIQA and offers theoretical insights for a wider spectrum of LFI research. We believe that the novel concepts introduced in this study, such as LF-DSC and LF-ASC, as well as the associated auxiliary learning scheme, can be readily extended to other LFI processing tasks, such as LFI super-resolution, LFI image classification, and LFI depth estimation. For example, the LF-DSC and LF-ASC can be employed as highly efficient generative LFI feature extractors for these tasks. Besides, the ALAS-DADS architecture as a whole can also be readily adapted to better perform the above LFI processing tasks.

References

  • [1] Y. Song, X. Chen, X. Wang, Y. Zhang, and J. Li, “6-DoF image localization from massive geo-tagged reference images,” IEEE Transactions on Multimedia, vol. 18, no. 8, pp. 1542–1554, 2016.
  • [2] P. Paudyal, F. Battisti, and M. Carli, “Reduced reference quality assessment of light field images,” IEEE Transactions on Broadcasting, vol. 65, no. 1, pp. 152–165, 2019.
  • [3] M. Broxton, J. Flynn, R. Overbeck, D. Erickson, P. Hedman, M. DuVall, J. Dourgarian, J. Busch, M. Whalen, and P. Debevec, “Immersive light field video with a layered mesh representation,” vol. 39, no. 4, pp. 86:1–86:15, 2020.
  • [4] “Apple invents a light field panorama camera system for idevices and hmd that will create immersive scenes with 6 degrees of freedom,” https://www.patentlyapple.com/patently-apple/2020/04/apple-invents-a-light-field-panorama-camera-system-for-idevices-hmd-that-will-create-immersive-scenes-with-6-degrees-of-fre.html, accessed: 2021-03-21.
  • [5] “Eye-sensing light field display: Delivering 3D creators’ visions to customers the way they intended,” https://www.sony.net/SonyInfo/technology/stories/LFD/, accessed: 2021-03-21.
  • [6] L. Shi, S. Zhao, and Z. Chen, “BELIF: Blind quality evaluator of light field image with tensor structure variation index,” in 2019 IEEE International Conference on Image Processing (ICIP).   IEEE, 2019, pp. 3781–3785.
  • [7] L. Shi, W. Zhou, Z. Chen, and J. Zhang, “No-reference light field image quality assessment based on spatial-angular measurement,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 11, pp. 4114–4128, 2019.
  • [8] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
  • [9] G. Wu, B. Masia, A. Jarabo, Y. Zhang, L. Wang, Q. Dai, T. Chai, and Y. Liu, “Light field image processing: An overview,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 7, pp. 926–954, 2017.
  • [10] N. K. Kalantari, T.-C. Wang, and R. Ramamoorthi, “Learning-based view synthesis for light field cameras,” ACM Transactions on Graphics (TOG), vol. 35, no. 6, pp. 1–10, 2016.
  • [11] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
  • [12] I. Ihrke, J. Restrepo, and L. Mignard-Debise, “Principles of light field imaging: Briefly revisiting 25 years of research,” IEEE Signal Processing Magazine, vol. 33, no. 5, pp. 59–69, 2016.
  • [13] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, vol. 2.   IEEE, 2003, pp. 1398–1402.
  • [14] L. Zhang, L. Zhang, X. Mou, and D. Zhang, “FSIM: A feature similarity index for image quality assessment,” IEEE transactions on Image Processing, vol. 20, no. 8, pp. 2378–2386, 2011.
  • [15] Z. Wang and Q. Li, “Information content weighting for perceptual image quality assessment,” IEEE Transactions on image processing, vol. 20, no. 5, pp. 1185–1198, 2010.
  • [16] L. Zhang, Y. Shen, and H. Li, “VSI: A visual saliency-induced index for perceptual image quality assessment,” IEEE Transactions on Image processing, vol. 23, no. 10, pp. 4270–4281, 2014.
  • [17] W. Xue, L. Zhang, X. Mou, and A. C. Bovik, “Gradient magnitude similarity deviation: A highly efficient perceptual image quality index,” IEEE Transactions on Image Processing, vol. 23, no. 2, pp. 684–695, 2013.
  • [18] H. Ziaei Nafchi and M. Cheriet, “Efficient no-reference quality assessment and classification model for contrast distorted images,” IEEE Transactions on Broadcasting, vol. 64, no. 2, pp. 518–523, 2018.
  • [19] X. Min, G. Zhai, K. Gu, Y. Liu, and X. Yang, “Blind image quality estimation via distortion aggravation,” IEEE Transactions on Broadcasting, vol. 64, no. 2, pp. 508–517, 2018.
  • [20] Q. Yang, Z. Ma, Y. Xu, L. Yang, W. Zhang, and J. Sun, “Modeling the screen content image quality via multiscale edge attention similarity,” IEEE Transactions on Broadcasting, vol. 66, no. 2, pp. 310–321, 2020.
  • [21] A. K. Moorthy and A. C. Bovik, “Blind image quality assessment: From natural scene statistics to perceptual quality,” IEEE transactions on Image Processing, vol. 20, no. 12, pp. 3350–3364, 2011.
  • [22] A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference image quality assessment in the spatial domain,” IEEE Transactions on image processing, vol. 21, no. 12, pp. 4695–4708, 2012.
  • [23] R. Ferzli and L. J. Karam, “A no-reference objective image sharpness metric based on the notion of just noticeable blur (jnb),” IEEE transactions on image processing, vol. 18, no. 4, pp. 717–728, 2009.
  • [24] S. Suthaharan, “No-reference visually significant blocking artifact metric for natural scene images,” Signal Processing, vol. 89, no. 8, pp. 1647–1652, 2009.
  • [25] S. Varadarajan and L. J. Karam, “An improved perception-based no-reference objective image sharpness metric using iterative edge refinement,” in 2008 15th IEEE international conference on image processing.   IEEE, 2008, pp. 401–404.
  • [26] W. Zhou, L. Shi, Z. Chen, and J. Zhang, “Tensor oriented no-reference light field image quality assessment,” IEEE Transactions on Image Processing, vol. 29, pp. 4070–4084, 2020.
  • [27] Y. Tian, H. Zeng, J. Hou, J. Chen, and K.-K. Ma, “Light field image quality assessment via the light field coherence,” IEEE Transactions on Image Processing, vol. 29, pp. 7945–7956, 2020.
  • [28] M. Abdi and S. Nahavandi, “Multi-residual networks: Improving the speed and accuracy of residual networks,” arXiv preprint arXiv:1609.05672, 2016.
  • [29] S. Targ, D. Almeida, and K. Lyman, “Resnet in resnet: Generalizing residual architectures,” arXiv preprint arXiv:1603.08029, 2016.
  • [30] S. Ruder, “An overview of multi-task learning in deep neural networks,” arXiv preprint arXiv:1706.05098, 2017.
  • [31] L. Xu, J. Li, W. Lin, Y. Zhang, L. Ma, Y. Fang, and Y. Yan, “Multi-task rank learning for image quality assessment,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 27, no. 9, pp. 1833–1843, 2016.
  • [32] K. Ma, W. Liu, K. Zhang, Z. Duanmu, Z. Wang, and W. Zuo, “End-to-end blind image quality assessment using deep neural networks,” IEEE Transactions on Image Processing, vol. 27, no. 3, pp. 1202–1213, 2017.
  • [33] B. Yan, B. Bare, and W. Tan, “Naturalness-aware deep no-reference image quality assessment,” IEEE Transactions on Multimedia, vol. 21, no. 10, pp. 2603–2615, 2019.
  • [34] L. Li, H. Zhu, S. Zhao, G. Ding, and W. Lin, “Personality-assisted multi-task learning for generic and personalized image aesthetics assessment,” IEEE Transactions on Image Processing, vol. 29, pp. 3898–3910, 2020.
  • [35] M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu, “Reinforcement learning with unsupervised auxiliary tasks,” arXiv preprint arXiv:1611.05397, 2016.
  • [36] A. Kendall, Y. Gal, and R. Cipolla, “Multi-task learning using uncertainty to weigh losses for scene geometry and semantics,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7482–7491.
  • [37] L. Shi, S. Zhao, W. Zhou, and Z. Chen, “Perceptual evaluation of light field image,” in 2018 25th IEEE International Conference on Image Processing (ICIP).   IEEE, 2018, pp. 41–45.
  • [38] R. Zhang, F. Zhu, J. Liu, and G. Liu, “Depth-wise separable convolutions and multi-level pooling for an efficient spatial CNN-based steganalysis,” IEEE Transactions on Information Forensics and Security, vol. 15, pp. 1138–1150, 2019.
  • [39] S. Ma, W. Liu, W. Cai, Z. Shang, and G. Liu, “Lightweight deep residual CNN for fault diagnosis of rotating machinery based on depthwise separable convolutions,” IEEE Access, vol. 7, pp. 57 023–57 036, 2019.
  • [40] E. Rahimian, S. Zabihi, S. F. Atashzar, A. Asif, and A. Mohammadi, “Xceptiontime: A novel deep architecture based on depthwise separable convolutions for hand gesture classification,” arXiv preprint arXiv:1911.03803, 2019.
  • [41] Z. Lu, H. W. Yeung, Q. Qu, Y. Y. Chung, X. Chen, and Z. Chen, “Improved image classification with 4D light-field and interleaved convolutional neural network,” Multimedia Tools and Applications, vol. 78, no. 20, pp. 29 211–29 227, 2019.
  • [42] P. Paudyal, F. Battisti, M. Sjöström, R. Olsson, and M. Carli, “Towards the perceptual quality evaluation of compressed light field images,” IEEE Transactions on Broadcasting, vol. 63, no. 3, pp. 507–522, 2017.
  • [43] H. W. F. Yeung, J. Hou, J. Chen, Y. Y. Chung, and X. Chen, “Fast light field reconstruction with deep coarse-to-fine modeling of spatial-angular clues,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 137–152.
  • [44] J. Zhang, Y. Liu, S. Zhang, R. Poppe, and M. Wang, “Light field saliency detection with deep convolutional networks,” IEEE Transactions on Image Processing, vol. 29, pp. 4421–4434, 2020.
  • [45] F. M. Dekking, C. Kraaikamp, H. P. Lopuhaä, and L. E. Meester, A Modern Introduction to Probability and Statistics: Understanding why and how.   Springer Science & Business Media, 2005.
  • [46] D. Zwillinger and S. Kokoska, CRC standard probability and statistics tables and formulae.   CRC Press, 1999.
  • [47] S. J. Reddi, S. Kale, and S. Kumar, “On the convergence of adam and beyond,” arXiv preprint arXiv:1904.09237, 2019.
  • [48] J. Yang, C. Ji, B. Jiang, W. Lu, and Q. Meng, “No reference quality assessment of stereo video based on saliency and sparsity,” IEEE Transactions on Broadcasting, vol. 64, no. 2, pp. 341–353, 2018.
  • [49] P. Astola, L. A. da Silva Cruz, E. A. da Silva, T. Ebrahimi, P. G. Freitas, A. Gilles, K.-J. Oh, C. Pagliari, F. Pereira, C. Perra et al., “JPEG Pleno: Standardizing a coding framework and tools for plenoptic imaging modalities,” ITU Journal: ICT Discoveries, 2020.