Multi-task deep CNN model for no-reference image quality assessment on smartphone camera photos

Abstract

Smartphone is the most successful consumer electronic product in today’s mobile social network era. The smartphone camera quality and its image post-processing capability is the dominant factor that impacts consumer’s buying decision. However, the quality evaluation of photos taken from smartphones remains a labor-intensive work and relies on professional photographers and experts. As an extension of the prior CNN-based NR-IQA approach, we propose a multi-task deep CNN model with scene type detection as an auxiliary task. With the shared model parameters in the convolution layer, the learned feature maps could become more scene-relevant and enhance the performance. The evaluation result shows improved SROCC performance compared to traditional NR-IQA methods and single task CNN-based models.

Index Terms— Image quality assessment, No-reference IQA, Convolutional neural networks, Smartphone camera photo.

1 Introduction

Image quality assessment (IQA) methods are developed to automatically to predict image quality without human subjective judgment, which is known to be costly and time-consuming. It was evident that various image distortions such as blur, noise, and JPEG compression artifact highly impact our perceived visual quality. Thus, plenty of distortion-oriented image quality databases have been constructed to support IQA researches, such as LIVE [1], CSIQ [2], and TID2013 [3]. However, in today’s mobile and social network era, most of the consumer photos are taken by smartphones with high resolution and compelling quality, stored on the device or sent to the cloud. For smartphone camera photos, usually they’re original in the raw format without distortion, or the only distortion comes from smartphone camera’s built-in Image Signal Processor (ISP). As the most successful consumer electronic product, the smartphone camera quality and its ISP post-processing capability is the dominant factor that impacts consumer’s purchase decision. Currently, the evaluation of smartphone camera photo quality still relies on professional photographers and experts.

The IQA methods can be categorized into three types depending on whether the original reference image is involved, which are full-reference IQA (FR-IQA), reduced-reference IQA (RR-IQA), and no-reference IQA (NR-IQA) methods. Although NR-IQA is more challenging due to lack of the original image, it remains an important research area because there are some application scenarios where the reference image is not available or even not exist, like the smartphone camera photos. Therefore, how to apply existing NR-IQA researches to smartphone camera photos becomes an urgent industrial need. To bridge the gap, a new smartphone camera photo database [4], denoted as SCPQD2020, is established to address image quality assessment issues on smartphone camera photos. Although there is no distortion in the collections of SCPQD2020, but the camera lens, the image sensors, and the processor decide the perceived quality.

Refer to caption — Fig. 1: The proposed multi-task CNN model with scene type as an auxiliary task.

In this work, a multi-task deep convolutional neural network (CNN) model is proposed for NR-IQA on smartphone camera photos, denoted as DCNNS. We add a secondary task, scene type detection as an auxiliary task to the primary quality prediction task, and simultaneously train the network with shared convolutional layer parameters. Through evaluation, we demonstrate that our multi-task CNN model does lead an optimization process to update the network weights for fitting the authentic distortion of smartphone photos and achieves better quality prediction.

This paper is organized as follows. We first review some relevant NR-IQA works in Section 2. Section 3 describes and discusses our proposed multi-task CNN model in detail. Then, we evaluate our DCNNS model and compare it to related NR-IQA methods, in Section 4. Finally, we conclude this writeup in the last section.

2 Related Works

The NR-IQA researches could be divided into two categories. One category focuses on designing better-handcrafted features, i.e., Natural Scene Statistics (NSS) features, characterizing the distribution of specific filter responses in the wavelet [5] or DCT [6] transform domain, but this approach is too slow to be used in real-world applications. BRISQUE [7] and NIQE [8] are later developed to extract features from the spatial domain with reduced computational time. The other category focuses on feature learning, which attempts to learn discriminant visual features automatically without handcrafting. CORNIA [9] demonstrates that it’s possible to learn discriminant image features from raw image pixels, instead of using handcrafted features.

To learn discriminant features, the successfully proven Convolutional Neural Network (CNN) models in computer vision and image recognition did inspire NR-IQA researchers. Kang et al. [10] applied a CNN model on their NR-IQA work and achieve state-of-the-art performance on LIVE dataset with good cross-database generalization ability. Later the same CNN architecture was revised to a compact multi-task CNN for simultaneously estimating image quality and identifying distortion types [11]. By reconsidering the quality prediction task as a multi-task problem of two different high-level tasks, the shared convolutional features help to achieve a similar or better performance of the state-of-the-art.

Deeper CNN model [12] and fine-tuning pre-trained CNN model on large image datasets [13, 14] were employed to develop more accurate NR quality prediction methods. However, as Kim et al. pointed out in their survey [14], a deeper or pre-trained CNN model increases the performance of NR-IQA to a competitive level of FR-IQA and handcrafted feature-based NR-IQA methods, but mainly on legacy synthesis distortion databases [1, 2, 3]. The quality prediction result on authentic distortion databases, such as LIVE ”In the Wild” Challenge Database [15] and SCPQD2020 [4] are still far behind the accuracy of legacy databases.

As traditional NR-IQA methods show unsatisfying results [4] on smartphone photos, Yao et al. [16] proposed a CNN model with residual block and integrated the feature extraction and regression into one optimization process to predict image quality. They select the saliency regions using saliency maps generated by SalGAN, and carefully extract feature maps on different aspects such as HSV color space conversion and Gabor wavelet, then send them to CNN model as input. Their experiments show a better performance for smartphone images than traditional NR-IQA methods.

3 Proposed Method

The proposed multi-task CNN model with scene type detection as an auxiliary task is presented in Figure 1. As Kim et al. [14] mentioned in their survey, we adopt a deep CNN architecture for no-reference image quality assessment on smartphone camera photos to leverage CNN’s strong representation and generalization capability. There are four quality measurement aspects, texture, color, noise, and exposure [4] in smartphone camera captured images. It’s intuitive to exploit different image color spaces or extract low-level features with handcrafted filters like [16]. However, our attempts show no significant difference in a deeper CNN network, so we use the converted grayscale image directly. Like traditional NR-IQA methods [7], we found that pre-processing can increase the performance and employed local contrast normalization as follows:

\displaystyle\hat{I}(i,j)

\displaystyle=\frac{I(i,j)-\mu(i,j)}{\sigma(i,j)+C}

The normalized pixel $\hat{I}(i,j)$ at position $(i,j)$ is obtained from subtracting the original pixel value $I(i,j)$ by $\mu(i,j)$ , which is the mean of pixel values within a local $P\times Q$ window centered at $(i,j)$ . Then, we divided the result by $\sigma(i,j)$ , which is the standard deviation of pixel values within the window. The constant $C=1$ is used to prevent divided by zero. We calculate $\mu(i,j)$ and $\sigma(i,j)$ with setting $P=Q=3$ as follows:

	$\displaystyle\mu(i,j)$	$\displaystyle=\sum_{p=-P}^{P}\sum_{q=-Q}^{Q}I(i+p,j+q)$
	$\displaystyle\sigma(i,j)$	$\displaystyle=\sqrt{\sum_{p=-P}^{P}\sum_{q=-Q}^{Q}(I(i+p,j+q)-\mu(i,j))^{2}}$

We crop the input images into non-overlapping $64\times 64$ patches with stride 160. Different sizes of stride could be used, but since the original smartphone camera captured images are in high resolution and we choose relatively small-sized patches, the resulting segmentation gives us enough training data to learn.

Through various experiments, we found that CNN can effectively iterate and update the learned model parameters in the loss minimization process, but also easy to be trapped without gaining further improvement if a certain level of performance is reached. As multi-task learning in neural networks demonstrated that learning multiple correlated tasks at the same time may improve the overall performances [17], we reformulate the quality prediction problem as a multi-task problem of two different high-level tasks. We add a scene type detection auxiliary task as a secondary task to train simultaneously with the quality score prediction task in one CNN. We denote our proposed multi-task CNN model as DCNNS and illustrate details of our method in the following sections.

3.1 Deep Multi-task CNN Model

The convolution layer of proposed DCNNS adopts a similar but smaller deep CNN architecture like VGG-16 [18]. The six convolution layers all use $3\times 3$ filter size with stride as 1 and padding by 1, which lead to two $64\times 64\times 32$ layers, two $32\times 32\times 64$ layers, and two $16\times 16\times 128$ layers. Max pooling of $2\times 2$ is performed twice to reduce the feature map resolution from $64\times 64$ to $16\times 16$ . After the final convolution layer, apply $16\times 16$ global average pooling to flatten feature maps to a 128-dimension fully connected layer as output.

In the quality score prediction task, the convolution output layer connects to two 256 fully connected (FC) layers, then regress to an overall quality score. The scene detection task has the same structure as that of the quality task, except for the last layer, which is composed of four neurons to represent the probabilities of different scene types. Dropout of probability 0.5 is applied to the two 256 FC layers to prevent overfitting. All the layers use ReLU as activation functions.

Because the two tasks share the same model parameters in the convolution layer, the learned feature maps could become more scene-relevant and better model the image quality.

3.2 Scene Categorization

We observed that there are typical types of scenes in smartphones taken photos, such as outdoor nature scenes, daylight buildings, indoor facilities, and night scenes, etc. Different scenes have different characteristics and affect human’s perceptual quality judgment. It is essential to provide the scene information as a clue for quality prediction during training. For categorizing scene types, images are horizontally and vertically divided by 8, forming 64 sub-blocks, then mean and standard derivation of each sub-block is calculated and split into 16 bins histogram, accumulating through all sub-blocks. To detection different kinds of edges, we apply edge filters of MPEG-7 image descriptors [19], as shown in Figure 2 on each sub-block and calculate the percentage of edge pixels, then accumulate across all sub-blocks. As a result, we extract a 37-dimension feature vector for each image and utilize the unsupervised K-means algorithm to cluster images into four types of scenes.

	$\displaystyle\begin{bmatrix}1&-1\\ 1&-1\end{bmatrix}\begin{bmatrix}1&1\\ -1&-1\end{bmatrix}\begin{bmatrix}\sqrt{2}&0\\ 0&-\sqrt{2}\end{bmatrix}$
	$\displaystyle\begin{bmatrix}0&\sqrt{2}\\ -\sqrt{2}&0\end{bmatrix}\begin{bmatrix}2&-2\\ -2&2\end{bmatrix}$

Fig. 2: Edge filters for finding scene clusters.

The clustering result provides the scene type label for the scene detection task. Figure 3 shows some sample images of the categorization result. We can see that essential photography characteristics such as contrast, brightness, and texture are distinguishable in each scene type.

3.3 Loss Function and Learning

Let $x_{n}$ and $y_{n}$ denote the input patch and its quality label. $f(x_{n};w)$ is the function of $x_{n}$ and network weights $w$ that predicts the quality score. The loss function of quality task $Lq$ is defined as follows:

L_{q}=\frac{1}{N}\sum_{n=1}^{N}\lVert f(x_{n};w)-y_{n}\lVert_{l_{1}}

(1)

We apply softmax function on the probability of scene category and adopt cross-entropy as loss function $L_{s}$ , where $f^{\prime}(x_{n};w^{\prime})$ is the function of $x_{n}$ and network weights $w^{\prime}$ that predicts scene type. Network weights $w$ and $w^{\prime}$ share the same convolution layer parameters except for FC layers. We define $L_{s}$ as follows, where $y_{n}^{(s)}$ denotes the scene label of $x_{n}$ from clustering:

L_{s}=\frac{1}{N}\sum_{n=1}^{N}H(f^{\prime}(x_{n};w^{\prime}),y_{n}^{(s)})

(2)

The overall loss function is defined as $L=L_{q}+\alpha L_{s}$ . We weight the scene detection loss function $L_{s}$ by $\alpha=1.0$ to guide the training process to learn more scene-relevant features. When we train the two tasks simultaneously with the combined loss function $L$ , the shared convolution layer parameters are updated in a way that improves the quality prediction and scene detection accuracy at the same time, as shown in Figure 4. Each scene type’s detection accuracy varies under four quality aspects, ranging from 0.28 to 0.58 and depending on different scene’s characteristics. Although the averaged scene type detection accuracy is around 0.41, the auxiliary task can be thought as a weak classifier to boost the performance. Table 1 shows the scene type detection task’s accuracy for each quality aspect at the corresponding best epoch.

Table 1: Scene type detection task accuracy

Scene	Texture	Color	Noise	Exposure
0	0.2867	0.3977	0.4103	0.2836
1	0.4228	0.3428	0.2852	0.4371
2	0.3629	0.4131	0.3730	0.4095
3	0.5639	0.5740	0.5869	0.5285
Average	0.4091	0.4319	0.4138	0.4147

For training, we split the SCPQD2020 database into 80%-20% parts as training and validation sets. We employ Adam optimizer with default settings in PyTorch to optimize our network parameters with learning rate 0.001 and batch size 128. We train one CNN model for each quality aspect with 50 epochs. Each model spends about 1.5 hours to complete the 50 epochs on an NVIDIA GeForce RTX 2080 Ti GPU with 11GB RAM.

4 Experimental Results

We compare our proposed multi-task CNN-based NR-IQA method with traditional NR-IQA methods BRISQUE [7] and NIQE [8]. As expected, the traditional NR-IQA methods do not perform well on smartphone camera photos. We also compare our work with two CNN-based IQA methods, one is a shallow CNN network [10] denoted as CNNIQA, and another is a CNN with residual block approach proposed by Yao et al [16].

4.1 Dataset

The SCPQD2020 dataset [4] is composed of 1,500 photos taken from 100 scenes using 15 smartphones, covering a wide range of prices and different manufacturers. The dataset includes various challenging scenes like outdoor nature scenes, indoor lowlight scenes, backlight scenes, and night scenes. For every image of the same scene, three professional photographers are recruited to rate subjective quality scores based on four aspects: texture, color, noise, and exposure. The exact quality scores are not provided in the dataset but the quality ranking of 15 images within a scene is disclosed, ranging from 1 to 15. Lower-ranking means better quality.

4.2 Performance Metric

The quality prediction result of each quality aspect is sorted and compared with ground truth subjective ranking using the Spearman Rank Order Correlation Coefficient (SROCC). SROCC mainly reflects the consistency between two ranking distributions and is defined as:

SROCC=1-\frac{6\sum_{i}d_{i}^{2}}{n(n^{2}-1)}

where $n$ is the number of images and $d_{i}$ is the rank difference between the subjective score and the objective prediction of the $i$ -th image.

4.3 Performance Comparison

We randomly split the dataset into 80%-20% parts as the training and the testing sets. Each quality aspect has a separate prediction model, and we train our model with 50 epochs and select the model with the highest SROCC, repeating the processes three times to find the average score. We use the same training sets to investigate the competing NR-IQA methods BRISQUE, NIQE, and CNNIQA to report their averaged SROCC values. Since the work from Yao et al. is not publicly available, we quote the performance number from their paper [16] as a reference. Table 2 shows the SROCC comparison.

Table 2: SROCC comparison of the proposed method and other IQA methods. The superior result is marked in bold.

Model	Texture	Color	Noise	Exposure
BRISQUE [7]	0.2065	0.1973	0.2139	0.2175
NIQE [8]	0.2518	0.2446	0.1896	0.2459
CNNIQA [10]	0.4405	0.4657	0.4485	0.3793
Yao et al. [16]	0.4414	0.4827	0.4525	0.4368
DCNNS	0.4899	0.4979	0.4723	0.4349

As we expected, the traditional NR-IQA methods do not perform well since they’re designed for synthesis distortions not for smartphone camera photos which are original raw images without distortion. When compared with a CNN-based NR-IQA model, CNNIQA, our proposed DCNNS model excels at both deeper network architecture and the scene auxiliary task that assists to learn more scene-relevant features. Although it may not be a head-to-head comparison, we quote the performance result from Yao to indicate that a deeper network with multi-task learning can achieve similar or superior performance, as their approach is compound with saliency detection, feature extraction, and color space conversion.

4.4 Discussion

Although we use scene type detection as an auxiliary task to assist more accurate image quality prediction, the scene type concept comes globally from all image parts and sometimes from image context. As shown in Figure 5, the image 55 is a night scene from its context, but clustered as scene 2 image-wise, given it’s the scene with balanced lighting and contains both smooth and complex region.

Since we label the scene type and predict the image quality on image patches of size $64\times 64$ , the scene type identified globally may not fit to each individual patch from different image regions. During the training process, some patches from specific region (Figure 5b) are classified as scene 3, the scene with less contrast and under exposure, while the other region (Figure 5c) are classified as scene 0, the night scene. The labelled ground truth scene 2 violates what the neural network learned from other training samples and keep penalizes the optimization process with loss function $L_{s}$ in equation (2). As a result, the image 55 has the lowest average scene detection accuracy for every quality aspect and lower SROCC scores. We present some representative images’ SROCC scores and scene detection accuracies in Table 3.

Table 3: SROCC v.s. Scene accuracy for selected images

Image		Texture	Color	Noise	Exposure
55	SROCC	0.4286	0.2950	-0.1827	0.2631
55	Scene Acc.	0.0289	0.0323	0.0326	0.0324
85	SROCC	0.5869	0.3874	0.3805	0.6970
85	Scene Acc.	0.7785	0.7028	0.6022	0.7750
47	SROCC	0.6200	0.7296	0.7621	0.3806
47	Scene Acc.	0.6714	0.6746	0.6750	0.6777

On the other hand, we investigate some images with overall high scene accuracy. From Figure 6, the image 85 is labelled as scene 1, the scene with complex texture and balanced exposure, achieves average scene accuracy 0.7146 and results to above average SROCC as 0.5130. Another image 47 of scene 3, scenes with less contrast and under exposure, also has high scene accuracy as 0.6747, which leads to significant higher SROCC 0.6231 than average. Detailed performance numbers are listed in Table 3

From the analysis of SROCC v.s. scene accuracy, we demonstrated that the scene accuracy does have high impact on the image quality prediction, which also echos the performance boost from the help of auxiliary task. As we pointed out the night scene detection challenge, how to effectively fuse the global image-wise feature v.s. the local patch-wise feature to correctly identify scene type will be a key to further improve smartphone photo quality assessment.

5 Conclusion

This paper proposes a multi-task deep CNN model with scene type detection as auxiliary task. A simple image clustering method is used to label scenes of images for the scene type detection task, therefore guide the optimization process to better fit the smartphone camera photos. The evaluation result shows improved SROCC performance compared to traditional NR-IQA methods and single task CNN-based models.

ACKNOWLEDGEMENT

This work is partially supported by the Ministry of Science and Technology, Taiwan and CITI SINICA, ROC, under the grant numbers of MOST 108-2218-e-002-055, MOST 108-2221-E-002-103-my3, and Sinica 3012-C3447.

References

[1] Hamid R Sheikh, Muhammad F Sabir, and Alan C Bovik, “A statistical evaluation of recent full reference image quality assessment algorithms,” IEEE Transactions on image processing, vol. 15, no. 11, pp. 3440–3451, 2006.
[2] Eric Cooper Larson and Damon Michael Chandler, “Most apparent distortion: full-reference image quality assessment and the role of strategy,” Journal of electronic imaging, vol. 19, no. 1, pp. 011006, 2010.
[3] Nikolay Ponomarenko, Lina Jin, Oleg Ieremeiev, Vladimir Lukin, Karen Egiazarian, Jaakko Astola, Benoit Vozel, Kacem Chehdi, Marco Carli, Federica Battisti, et al., “Image database tid2013: Peculiarities, results and perspectives,” Signal processing: Image communication, vol. 30, pp. 57–77, 2015.
[4] Wenhan Zhu, Guangtao Zhai, Zongxi Han, Xiongkuo Min, Tao Wang, Zicheng Zhang, and Xiaokang Yang, “A multiple attributes image quality database for smartphone camera photo quality assessment,” arXiv preprint arXiv:2003.01299, 2020.
[5] Anush Krishna Moorthy and Alan Conrad Bovik, “Blind image quality assessment: From natural scene statistics to perceptual quality,” IEEE transactions on Image Processing, vol. 20, no. 12, pp. 3350–3364, 2011.
[6] Michele A Saad, Alan C Bovik, and Christophe Charrier, “Blind image quality assessment: A natural scene statistics approach in the dct domain,” IEEE transactions on Image Processing, vol. 21, no. 8, pp. 3339–3352, 2012.
[7] Anish Mittal, Anush Krishna Moorthy, and Alan Conrad Bovik, “No-reference image quality assessment in the spatial domain,” IEEE Transactions on image processing, vol. 21, no. 12, pp. 4695–4708, 2012.
[8] Anish Mittal, Rajiv Soundararajan, and Alan C Bovik, “Making a “completely blind” image quality analyzer,” IEEE Signal processing letters, vol. 20, no. 3, pp. 209–212, 2012.
[9] Peng Ye, Jayant Kumar, Le Kang, and David Doermann, “Unsupervised feature learning framework for no-reference image quality assessment,” 2012 IEEE conference on computer vision and pattern recognition, pp. 1098–1105, 2012.
[10] Le Kang, Peng Ye, Yi Li, and David Doermann, “Convolutional neural networks for no-reference image quality assessment,” Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1733–1740, 2014.
[11] Le Kang, Peng Ye, Yi Li, and David Doermann, “Simultaneous estimation of image quality and distortion via multi-task convolutional neural networks,” 2015 IEEE international conference on image processing (ICIP), pp. 2791–2795, 2015.
[12] Sebastian Bosse, Dominique Maniry, Klaus-Robert Müller, Thomas Wiegand, and Wojciech Samek, “Neural network-based full-reference image quality assessment,” 2016 Picture Coding Symposium (PCS), pp. 1–5, 2016.
[13] Yuming Li, Lai-Man Po, Litong Feng, and Fang Yuan, “No-reference image quality assessment with deep convolutional neural networks,” 2016 IEEE International Conference on Digital Signal Processing (DSP), pp. 685–689, 2016.
[14] Jongyoo Kim, Hui Zeng, Deepti Ghadiyaram, Sanghoon Lee, Lei Zhang, and Alan C Bovik, “Deep convolutional neural models for picture-quality prediction: Challenges and solutions to data-driven image quality assessment,” IEEE Signal processing magazine, vol. 34, no. 6, pp. 130–141, 2017.
[15] Deepti Ghadiyaram and Alan C Bovik, “Massive online crowdsourced study of subjective and objective picture quality,” IEEE Transactions on Image Processing, vol. 25, no. 1, pp. 372–387, 2015.
[16] Chang Yao, Yuri Lu, Hang Liu, Menghan Hu, and Qingli Li, “Convolutional neural networks based on residual block for no-reference image quality assessment of smartphone camera images,” 2020 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp. 1–6, 2020.
[17] Rich Caruana, “Multitask learning,” Machine learning, vol. 28, no. 1, pp. 41–75, 1997.
[18] Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[19] Bangalore S Manjunath, J-R Ohm, Vinod V Vasudevan, and Akio Yamada, “Color and texture descriptors,” IEEE Transactions on circuits and systems for video technology, vol. 11, no. 6, pp. 703–715, 2001.