Deep Shape-Texture Statistics for Completely Blind Image Quality Evaluation

Yixuan Li, Peilin Chen, Hanwei Zhu, Keyan Ding, Leida Li, , and Shiqi Wang Y. Li, P. Chen, H. Zhu, and S. Wang are with the Department of Computer and Science, City University of Hong Kong, Hong Kong (e-mail: yixuanli423@gmail.com; plchen3@cityu.edu.hk; hanwei.zhu@my.cityu.edu.hk; shiqwang@cityu.edu.hk). K. Ding is with the Hangzhou Global Scientific and Technological Innovation Center, Zhejiang University, Zhejiang, China (e-mail: dingkeyan@zju.edu.cn). L. Li is with the School of Artificial Intelligence, Xidian University, Xi’an, China (e-mail: reader1104@hotmail.com). Corresponding author: Shiqi Wang.

Abstract

Opinion-Unaware Blind Image Quality Assessment (OU-BIQA) models aim to predict image quality without training on reference images and subjective quality scores. Thereinto, image statistical comparison is a classic paradigm, while the performance is limited by the representation ability of visual descriptors. Deep features as visual descriptors have advanced IQA in recent research, but they are discovered to be highly texture-biased and lack of shape-bias. On this basis, we find out that image shape and texture cues respond differently towards distortions, and the absence of either one results in an incomplete image representation. Therefore, to formulate a well-round statistical description for images, we utilize the shape-biased and texture-biased deep features produced by Deep Neural Networks (DNNs) simultaneously. More specifically, we design a Shape-Texture Adaptive Fusion (STAF) module to merge shape and texture information, based on which we formulate quality-relevant image statistics. The perceptual quality is quantified by the variant Mahalanobis Distance between the inner and outer Shape-Texture Statistics (DSTS), wherein the inner and outer statistics respectively describe the quality fingerprints of the distorted image and natural images. The proposed DSTS delicately utilizes shape-texture statistical relations between different data scales in the deep domain, and achieves state-of-the-art (SOTA) quality prediction performance on images with artificial and authentic distortions.

Index Terms:

Blind image quality assessment, image statistics, shape-texture bias

I Introduction

In the digital content boosting era, high-quality images are highly desired. However, images are inevitably corrupted by ubiquitous distortions during image formulation or after-processing. Quality perception is the natural instinct for the human visual system (HVS), but it is nontrivial to be modeled mathematically. Therefore, it is essential to design effective image quality measures for monitoring and evaluating the visual quality of digital images.

Blind image quality assessment (BIQA) targets at mimicking human quality perception and automatically quantifies quality disturbance without information from corresponding pristine images. It is necessarily adaptable to wide application scenarios but is challenging due to the absence of reference information. Essentially, BIQA methods [1, 2, 3, 4, 5, 6, 7] follow the pipeline of “Feature extraction-Supervised feature fusion-Quality prediction”, where the mapping to the objective quality is learned under the supervision of mean opinion scores (MOSs), e.g., learning a regression model. In other words, these models are exposed to MOSs while training. However, MOSs are the subjective image quality labels that are singly annotated by crowd-sourcing, becoming costly and inefficient in today’s data-boosting context. Such dependence on the supervision of MOSs may pose challenges in scalability and efficiency of quality prediction. Recognizing these challenges, the need for opinion-unaware (OU) BIQA methods is becoming increasingly pressing. Such OU-BIQA models offer a promising direction by eliminating the reliance on subjective opinions and reference images in model learning, paving the way for more robust and versatile image quality assessment in diverse applications.

Refer to caption — Figure 1: Visualization of image shape-texture activation and statistical distribution. The top row contains a distorted image and its Class Activation Mapping (CAM) maps [8] of shape and texture models, which indicate the regions attended by models. Cooler color indicates more attentions paid by models. The second row contains the t-SNE [9] visualized image statistical distributions in terms of shape, texture, and shape-texture deep features.

In contrast to opinion-aware (OA) BIQA methods, OU-BIQA models can assess the quality of an arbitrary image without training on MOSs, aiming at completely blind. Current OU-BIQA methods can be classified into three types: natural scene statistics (NSS) based [10, 11, 12, 13, 14, 15], pseudo-label assisted [16, 17, 18], and neural network based [19]. Thereinto, the NSS-based statistical comparison methods takes a preponderant role due to the sound NSS theories. NSS models demonstrate that natural images possess high statistical regularities [20, 21, 22, 23], whose quality representation capability has been thoroughly investigated in different domains, such as spatial domain [11], wavelet transform domain [2], and discrete cosine transform domain [3]. However, the performance of NSS-based methods still remains somewhat unsatisfactory in OU-BIQA, implying the projected coefficients may not sufficiently capture meaningful information denoting the intrinsic image quality. Meanwhile, IQA methods utilizing pretrained deep neural networks (DNNs) have gained prominence in recent years [24, 25, 26, 27, 28, 29], attributed to the powerful quality representation capability of deep learning features. Therefore, we tend to cultivate the statistical property of images within a more quality-relevant domain, i.e., the deep learning feature domain. However, the IQA-preferred off-the-shelf deep features produced by current pretrained DNNs are verified to be highly texture-biased [30, 31, 32, 33], referring to the over-reliance on object texture compared with the shape. On the contrary, it is universally revealed that human cognition highly relies on shapes (e.g., silhouettes of cats) in images instead of textures (e.g., markings of fur) [34, 35]. Accordingly, utilizing singly texture-biased deep features may not be able to represent image quality completely. The shape-biased and texture-biased features turn out to attend on complementary image areas (as shown in the first row of Fig. 1), where they accordingly exhibit different quality statistics that reflect different patterns of manifestation (as shown in the second row of Fig. 1), which can be employed to formulate a well-rounded shape-texture statistical description for image quality. Besides, the effectiveness of incorporating shape-relevant cues in IQA has been validated by previous works [36, 37, 38]. Therefore, in this paper, we incorporate shape-biased with texture-biased deep features which can compromise a more complete quality description of images.
Herein, we make one of the first attempts to investigate the shape-texture oriented image statistics, based on which we establish an NSS-based model for completely blind IQA. In specific, we utilize two DNN branches to respectively extract shape-biased and texture-biased deep features, and we design a Shape-Texture Adaptive Fusion (STAF) module to merge them together. Based on the shape-texture oriented deep representation, we formulate the inner statistics that denote the intrinsic quality patterns of each distorted image, and the outer statistics that represent the quality fingerprints of the natural image domain. The model predicts image quality by measuring the statistical Distance between the inner and outer Shape-Texture Statistics (DSTS). The proposed deep shape-texture statistical notion is quality-aware and free of task-specific training, enabling DSTS to be able to measure the perceptual quality effectively without reference images, MOSs, or any other additional information. We perform extensive experiments to demonstrate the superiority of the proposed DSTS. It turns out that DSTS has conspicuous performance in terms of quality prediction accuracy, generalization ability, and personalized quality assessment.

II Related Works

II-A OU-BIQA Methods

The pioneer work of OU-BIQA is proposed by Mittal et al. [10], where the quality-aware visual-word distributions are formulated to evaluate visual quality without subject opinions, but the distortion type as auxiliary information is still required. Later, in [11], a natural image quality evaluator (NIQE) is proposed based on the comparison of NSS features between a bunch of pristine images and the distorted image. Furthermore, Zhang et al. [12] and Liu et al. [13] improved the NIQE to a patch-level evaluation in terms of refined quality features. Xue et al. [39] addressed the opinion-unaware problem by learning quality-aware centroids as the codebook to predict image patch quality, while it can only deal with four distortion types. In [40], local patch features are extracted and projected into quality scores according to unsupervised patch distortion classification. In [18], the authors proposed to train a model using pseudo opinion scores derived from full-reference IQA models, while in [41] the pseudo quality rank is employed for supervision. Wu et al. [42] proposed an unsupervised effective local pattern statistics index (LPSI) by reforming local binary patterns into quality-aware features. Ma et al. [17] chose best-trusted IQA measures to form quality-discriminable image pairs and adopted the pair-wise learning-to-rank algorithm to learn the quality projection without subjective opinions. Zhu et al. [19] proposed to employ the discriminator of trained Wasserstein generative adversarial networks (WGAN) to measure the distance between pristine images and distorted image. Thereafter, Wang et al. [16] established a new image dataset with pseudo quality labels, based on which they trained a DNN-based IQA model considering the domain shift caused by distortion and content between the distorted and natural images. Though numerous OU-BIQA models are proposed, the one that has better correlations to human visual quality is still in need.

II-B Shape and Texture Bias in CNNs

ImageNet [43]-pretrained CNNs exhibit excellent performance in many computer vision tasks [44]. However, it has been found that they extensively rely on superficial textural cues rather than shapes, which runs counter to human vision [34, 35]. Geirhos et al. [30] initially propounded CNNs’ over-preference towards texture as texture-bias. The authors consolidated the opposite bias between the recognition of human and CNNs using cue-conflict images (e.g., a zebra-textured cat). They proposed a CNN training strategy for better robustness by removing texture information in images with style transfer models. Islam et al. [31] provided further evidence that texture bias appears not only in the outputs of CNNs, but also in the latent representations of intermediate layers. Thereafter, Mummadi et al. [32] pointed out that solely increasing the shape-bias of CNNs can result in increased sensitivity towards distortions, which can conversely benefit the representation capability of image quality corruptions. Recently, Qiu et al. [45] demonstrated that integrating shape and texture information yields superior out-of-distribution robustness compared to the adoption of either attribute in isolation. Therefore, incorporating shape and texture information simultaneously can formulate a unified shape-texture deep representation for images.

II-C Image Statistics

Image statistics exhibit distinct properties and connotations across diverse scales of image space. According to data range, it can be classified into scene scale statistics, domain scale statistics, and single-image scale statistics.

Scene Scale Statistics Image scene statistics refer to the general property possessed by mass natural images. For example, the NSS models reveal that the responses of natural images to band-pass filters obey the generalized Gaussian distribution (GGD) or Gaussian scale mixture model (GSM)[22, 20, 21, 23].

Domain Scale Statistics Domain-scale statistics reflect the shared characters in terms of content, attribute, or style of an image set, which are mostly gained by training DNNs on image databases [46]. To this extent, the domain-scale image statistics derived from a set of homogeneous and representative images can be regarded as a sufficient estimation of scene-scale image statistics de facto.

Single Image Scale Statistics Single-image statistics represent the image internal statistical characteristics, which can be captured by analyzing the data distribution of its diverse local patches. In [47], Zhang et al. stated that the 3D transform coefficients of all image blocks within a single natural image together with those of their non-local similar blocks satisfy the GGD. In [48] and [49], GANs were adopted to learn the statistical patch distribution of a single image.

III The proposed DSTS for OU-BIQA

III-A Problem Formulation

To predict the quality of distorted images, we first formulate image outer and inner statistical distributions and quantify visual quality with a quality-aware distance measure. In principle, the statistical distributions are based upon the shape-texture statistics. Herein, we denote the density $p(\mathbf{x})$ of the discrete image embedding $\varrho$ as the sum of Dirac delta distributions [50]:

p(\mathbf{x})=\frac{1}{N}\sum_{n=1}^{N}\delta_{\varrho_{n}}(\mathbf{x}),

(1)

where $x$ denotes the observation of the random variable, $N$ denotes the component number of $\varrho$ , and $\varrho_{n}$ is the $n\mbox{-}th$ component. We aim to define a quality-aware statistical distance measure between the inner and outer statistical distributions on the basis of the classic Mahalanobis Distance (MD) [51]. As such, i.e., the larger the distance, the lower the visual quality. Suppose that the outer statistics obey the probability distribution with the mean value vector $\bm{\mu}_{G}$ and covariance matrix $\bm{\Sigma}_{G}$ , meanwhile the inner statistics are with $\bm{\mu}_{M}$ and $\bm{\Sigma}_{M}$ . Then their MD, denoted as $D_{m}(p_{G}(\mathbf{x}),p_{M}(\mathbf{x}))$ , is given by,

	$\displaystyle D_{m}(p_{G}(\mathbf{x}),p_{M}($	$\displaystyle\mathbf{x}))=$		(2)
		$\displaystyle\sqrt{(\bm{\mu}_{G}-\bm{\mu}_{M})^{\intercal}(\frac{\bm{\Sigma}_{G}+\bm{\Sigma}_{M}}{2})^{-1}(\bm{\mu}_{G}-\bm{\mu}_{M})}\ ,$		(2)

However, MD is an effective statistical distance metric mathematically yet remains limited in quantifying visual quality, because MD is position-agnostic in terms of the location relations between image patches [50], which are crucial for visual perception and content comprehension.

Therefore, we approximate the inner distribution with the patch samples, and the MDs between inner samples and the outer distribution are finally merged, leading to the final quality-aware distance,

\small D_{q}(p_{G}(\mathbf{x}),p_{M}(\mathbf{x}))=\mathcal{F}(\{D_{m}(p_{G}(\mathbf{x}),\bm{M}_{i})\}),\ i\in[1,W].

(3)

Specifically, for the sample space $\bm{S}_{M}=\{\bm{M}_{1},\bm{M}_{2},\ldots,\bm{M}_{W}\}$ of $p_{M}(\mathbf{x})$ , the MD between each sample $\bm{M}_{i}$ and $p_{G}(\mathbf{x})$ is

\displaystyle D_{m}(p_{G}(\mathbf{x}),\bm{M}_{i})=\sqrt{(\bm{\mu}_{G}-\bm{M}_{i})^{\intercal}(\frac{\bm{\Sigma}_{G}+\bm{\Sigma}_{M}}{2})^{-1}(\bm{\mu}_{G}-\bm{M}_{i})}.

(4)

It is trivial to see that $D_{m}(p_{G}(\mathbf{x}),p_{M}(\mathbf{x}))$ can be represented as the linear combination of $\{D_{m}(p_{G}(\mathbf{x}),M_{i})\}$ , $i\in[1,W]$ . As such, by adjusting the weights according to the perceptual importance of each sample [12], we can formulate a new quality-aware statistical distance measure:

\displaystyle D_{q}(p_{G}(\mathbf{x}),p_{M}(\mathbf{x}))=\sum_{i=1}^{W}\omega_{i}\cdot D_{m}(p_{G}(\mathbf{x}),M_{i}),

(5)

where $\omega_{i}$ is the carefully designed content weighting with the constraint of $\sum_{i=1}^{W}\omega_{i}=1$ , and $W$ is the number of sample partitions. In this case, the $\mathcal{F}$ in Eqn. (3) is a linear combination. According to the problem formulations above, the proposed framework is crystallized. Overall, the pipeline is illustrated in Fig. 2.

III-B Shape-Texture Statistics Modeling

In this subsection, we mathematically extract the inner and outer shape-texture statistics. We first present how the image is transformed into the shape-texture oriented perceptual domain represented with deep features, and then we establish the statistical modeling of both inner and outer statistics.

Shape-Texture Oriented Perceptual Transformation Deep features of pretrained classification networks have been repeatedly shown to efficiently represent the rich visual information from the raw pixel domain, which is crucial for characterizing perceptual distortions [28]. Along this vein, we adopt the shape-biased and texture-biased EfficientNet-b7 [52] simultaneously as transformation backbones, denoted as $\vartheta_{s}(\cdot)$ and $\vartheta_{t}(\cdot)$ respectively. The overall perceptual transformation process is illustrated in Fig. 3. Given an arbitrary image $\bm{I}$ , the transformation can be obtained by $\tilde{\bm{I}}_{s}=\vartheta_{s}(\bm{I})$ and $\tilde{\bm{I}}_{t}=\vartheta_{t}(\bm{I})$ . We employ multiple-layer latent features simultaneously to capture size-dependent visual patterns contained in image patches [53]. In total, five layers of outputs are included respectively for the two substrates, which correspond to the $2^{nd}$ to $6^{th}$ convolutional stages. To fuse them, we iteratively conduct spatial down-sampling on the first to forth layers of outputs for four to one times respectively, using the 2-D low-pass filter $\bm{\psi}(\cdot)$ with the stride of 2, where the $\bm{\psi}(\cdot)$ specified by $\{r_{k,l}|k=-K,...,K;l=-L,...,L\}$ is a unit-volume circularly symmetric Gaussian window. Finally, the five components are sequentially concatenated along channels. As such, we can obtain the shape-biased and texture-biased deep features $\bm{\tilde{I}}_{s}$ and $\bm{\tilde{I}}_{t}$ respectively.

To merge the shape and texture biased features, we propose a Shape-Texture Adaptive Fusion (STAF) module, to obtain the shape-texture deep embedding, denoted as $\bm{\tilde{I}}_{m}=STAF(\bm{\tilde{I}}_{s},\bm{\tilde{I}}_{t})$ . In specific, we first compute the variance along the channel dimension at each spatial locations,

\bm{V}_{s}(p,q)=var(\bm{\tilde{I}}_{s}(p,q)),\ \bm{V}_{t}(p,q)=var(\bm{\tilde{I}}_{t}(p,q)),

(6)

where $(p,q)$ denotes the spatial indices, $p\in\{1,2,...,P\}$ , $q\in\{1,2,...,Q\}$ , and $P$ and $Q$ denote feature’s spatial dimensions. The $var(\cdot)$ denotes the computation of variance. We then calculate the shape-biased and texture-biased attentions respectively:

		$\displaystyle\bm{A}_{s}(p,q)=\frac{\bm{V}_{s}(p,q)}{\bm{V}_{s}(p,q)+\bm{V}_{t}(p,q)},$		(7)
		$\displaystyle\bm{A}_{t}(p,q)=\frac{\bm{V}_{t}(p,q)}{\bm{V}_{s}(p,q)+\bm{V}_{t}(p,q)}.$		(7)

The final shape-texture embedding is obtained by adaptively aggregating $\bm{\tilde{I}}_{s}$ and $\bm{\tilde{I}}_{t}$ together:

\bm{\tilde{I}}_{m}=\bm{\tilde{I}}_{s}\odot\bm{A}_{s}+\bm{\tilde{I}}_{t}\odot\bm{A}_{t},

(8)

where $\odot$ denotes the Hadamard product. In this case, an image is transformed into the shape-texture oriented perceptual domain.

Statistical Modeling Generally speaking, image local statistical descriptions could work as efficient visual pattern descriptors [54]. Therefore, we first compute the intra-channel local average maps $\bm{I}_{\mu}$ and standard deviation maps $\bm{I}_{\sigma}$ of $\bm{\tilde{I}}_{m}$ respectively:

\bm{I}_{\mu}(p,q)=\sum^{K}_{k=-K}\sum^{L}_{l=-L}r_{k,l}\cdot\bm{\tilde{I}}_{m}(p+k,q+l),

(9)

\bm{I}_{\sigma}(p,q)=\sqrt{\sum^{K}_{k=-K}\sum^{L}_{l=-L}r_{k,l}\cdot(\tilde{\bm{I}}_{m}(p+k,q+l)-\bm{I}_{\mu}(p,q))^{2}}.

(10)

To make the statistical analysis across different convolutional layers with uncertain magnitude ranges consistent, we conduct the layer-wise $\ell_{2}$ -normalization to project $\bm{I}_{\mu}$ of convolutional layer $i$ onto the unit hypersphere using

\bm{\tilde{I}}_{\mu i}=\frac{\bm{I}_{\mu i}}{{||\bm{I}_{\mu i}||}_{2}},

(11)

and then all the $\tilde{\bm{I}}_{\mu i}$ components are concatenated into $\bm{\tilde{I}}_{\mu}$ sequentially. We utilize $\gamma$ to denote $(p,q)$ for simplification, where $\gamma\in\{1,W\}$ and $W$ is the number of spatial patches.

In DSTS, the component $\bm{I}_{\mu}^{(\gamma)}$ at each spatial location $\gamma$ is treated as a random observation sampled from the inner and outer statistical distributions. These distributions are multivariate with respect to the dimension of feature channel number $c$ . For the image outer statistics, it requires an adequate amount of candidate images as the knowledge source to describe the statistical characteristics of images. However, not all observations are entirely equal in representing the quality of the natural image domain. We observe that heterogeneous image areas react more sensibly to distortion disturbance compared with homogeneous areas, implying that they can reflect more quality-relevant cues. Therefore, we only consider the samples containing vital image structures (e.g., edges, corners). To distinguish these patches apart, we compute the cross-channel average of the deviation $\bm{I}_{\sigma}$ to form a structure indicator:

\bm{\tilde{I}}_{\sigma}^{(\gamma)}=\frac{1}{c}\sum_{o=1}^{c}\bm{I}_{\sigma o}^{(\gamma)},

(12)

where $o$ is the channel index. We only consider the patches with $\bm{\tilde{I}}_{\sigma}^{(\gamma)}\geq\tau$ , where $\tau$ is the filtering threshold. Then we stack all qualified observations derived from all the good-quality images into the sample set $\bm{S}_{G}\in\mathbb{R}^{s\times c}$ , where $s$ denotes the number of qualified observations from all pristine images. The outer statistics are characterized by the mean and covariance of $\bm{S}_{G}$ , denoted as $\bm{\mu}_{G}$ and $\bm{\Sigma}_{G}$ :

\bm{\mu}_{G}(1,j)=\frac{1}{s}\sum^{s}_{i=1}\bm{S}_{G}(i,j),

(13)

{\bm{\Sigma}_{G}}(i,j)=\frac{1}{s-1}(\bm{S}_{G}-{\bm{\mu}_{G}})^{\intercal}(\bm{S}_{G}-{\bm{\mu}_{G}}),

(14)

where $i,\ j$ denote element indice in $\bm{S}_{G}$ , $\bm{\mu}_{G}\in\mathbb{R}^{1\times c}$ , and ${\bm{\Sigma}_{G}}\in\mathbb{R}^{c\times c}$ .

Meanwhile, the image inner statistics aim at capturing the intrinsic quality pattern contained in a single image. Analogously, we count $\bm{\tilde{I}}_{\mu}^{(\gamma)}$ of the distorted image as a random observation of the inner distribution, and stack into $\bm{S}_{M}\in\mathbb{R}^{{s}^{\prime}\times c}$ , where ${s}^{\prime}$ denotes the number of observations in the distorted image. In this case, all the samples are from the same distorted image without filtering. Similarly, the inner statistics are specified with the mean and covariance of $\bm{S}_{M}$ , represented by $\bm{\mu}_{M}$ and $\bm{\Sigma}_{M}$ .

III-C DSTS: Distance Based on Shape-Texture Statistics

In Section III-A, we have specified the quality-aware distance measure between the inner and outer statistical distributions $p_{M}(\mathbf{x})$ and $p_{G}(\mathbf{x})$ on the basis of Mahalanobis distance (as shown in Eqn. (4) and Eqn. (5)). It is worth mentioning that the outer statistics are obtained before the quality prediction stage. As such, DSTS is a completely blind image quality indicator de facto, where each distorted image is insulated from its pristine counterpart and quality label. To ensure the covariance matrices $\bm{\Sigma}_{G}$ and $\bm{\Sigma}_{M}$ positive definite, a regularization term $\lambda$ is added to the diagonal elements:

\bm{\Sigma}_{G}=\bm{\Sigma}_{G}+\lambda\cdot I_{0},\ \bm{\Sigma}_{M}=\bm{\Sigma}_{M}+\lambda\cdot I_{0},

(15)

where $I_{0}$ is the identity matrix, and $\lambda=1\times e^{-6}$ . The DSTS index of each distorted image is computed with

	$\displaystyle DS$	$\displaystyle TS=D_{q}(p_{G}(\mathbf{x}),p_{M}(\mathbf{x}))=$		(16)
		$\displaystyle\sum^{W}_{\gamma=1}\bm{\tilde{I}}_{\sigma}^{(\gamma)}\cdot\sqrt{(\bm{\mu}_{G}-\bm{\tilde{I}}_{\mu}^{(\gamma)})^{\intercal}(\frac{\bm{\Sigma}_{G}+\bm{\Sigma}_{M}}{2})^{-1}(\bm{\mu}_{G}-\bm{\tilde{I}}_{\mu}^{(\gamma)})},$		(16)

where $\bm{\tilde{I}}_{\sigma}^{(\gamma)}$ serves as the content weighting parameter in Eqn. (5). Therefore, a larger DSTS value indicates a lower quality level.

III-D Connections to Relevant IQA Methods

The proposed DSTS scheme is closely related to several IQA models. Herein, we elaborate on the connections and differences.

NIQE and its variants: The NIQE [11] and its variants [12, 13, 55] have been proposed to build NSS models based on low-level visual descriptors of images, where the NSS fitting parameters are extracted and utilized as the to-be-compared quality features. Our model partially inherits this paradigm. Specifically, instead of extracting handcrafted NSS features from the low-level visual descriptors, we directly employ the shape-texture statistics based on deep features to achieve statistical comparisons, which exploits more visually relevant quality cues.

Deep Statistical Comparisons based FR-IQA: The statistical comparison FR-IQA models aim to model the HVS by comparing feature statistics [56]. In [57, 58, 59], feature distribution comparisons have been verified to be effective in image quality optimization. Deep Wasserstein distance [29] calculates the visual quality based on the 1-D Wasserstein distance between the statistical distributions of deep features derived from the distorted and reference image patches. Deep distance correlation [60] measures both linear and non-linear correlations in the deep feature domain. Although they both make comparisons between the intermediate responses of DNN features, the effectiveness of such statistical comparisons highly relies on the pristine reference images for FR-IQA. In contrast, DSTS extracts and compares quality-relevant statistics in respect of shape and texture without the necessity to align by semantic cues or contents.

IV Experiments

In this section, we first provide the experimental setups, and then extensively verify the effectiveness of the proposed DSTS regarding image quality prediction accuracy on 13 IQA databases with a broad spectrum of distortions. Moreover, we demonstrate DSTS’s superiority on generalization ability and personalized quality assessment even compared with OA-BIQA models.

IV-A Experimental Settings

IV-A1 IQA Databases

To validate the performance of the proposed DSTS, we evaluate it on (1) five synthetic distortion IQA databases: LIVE [61], CSIQ [62], TID2013 [63], KADID-10k [64], MDIVL [65], and MDID [66]; (2) five authentic distortion IQA databases: KonIQ-10k [67], CLIVE [68], BID [69], CID2013 [70], and SPAQ [71]; and (3) three databases containing generative distortions: AGIQA-3k [72], GFIQA-20k [73], and LGIQA [74] (containing cityscapes, cat and ffhq subsets). The database details are listed in the supplementary materials (SM).

IV-A2 Evaluation Criteria

Three criteria are employed to evaluate performance, including Pearson Linear Correlation Coefficient (PLCC), Root Mean Square Error (RMSE), and Spearman Rank order Correlation Coefficient (SRCC). In particular, the PLCC and RMSE are responsible for prediction accuracy, while SRCC can reflect the monotonicity of prediction. The IQA method contains superiority when it has higher PLCC, SRCC, and lower RMSE. The PLCC and RMSE are computed after a five-parameter non-linear mapping between subjective and objective scores using the function as follows:

f(x)=a_{1}\left\{\frac{1}{2}-\frac{1}{1+exp[{a_{2}(x-a_{3})]}}\right\}+a_{4}x+a_{5},

(17)

where $x$ and $f(x)$ respectively denotes the predicted and mapped score, and $a_{1}$ to $a_{5}$ are fitting parameters for logistic regression.

TABLE I: Quality Prediction Performance Comparisons on Individual Database Containing Artificial Distortions. The First and Second Best Performance are Highlighted in Bold and Underline Respectively. Cases Where Shape Outperforms Texture are Marked in Red and Cases Where Texture Outperforms Shape are Marked in Green.

OU-IQA Methods	LIVE			CSIQ			TID2013			KADID			MDID			MDIVL
OU-IQA Methods	PLCC	SRCC	RMSE	PLCC	SRCC	RMSE	PLCC	SRCC	RMSE	PLCC	SRCC	RMSE	PLCC	SRCC	RMSE	PLCC	SRCC	RMSE
NIQE	0.902	0.906	11.779	0.693	0.619	0.189	0.398	0.311	1.137	0.438	0.378	0.974	0.670	0.649	1.636	0.557	0.566	19.829
QAC	0.807	0.868	27.322	0.593	0.480	0.263	0.437	0.372	1.115	0.390	0.239	0.997	0.489	0.324	1.922	0.571	0.552	19.604
PIQUE	0.836	0.840	14.994	0.675	0.512	0.194	0.525	0.364	1.055	0.373	0.237	1.005	0.333	0.253	2.078	0.504	0.492	20.625
LPSI	0.786	0.818	27.322	0.695	0.522	0.263	0.489	0.395	1.081	0.367	0.148	1.007	0.354	0.031	2.061	0.559	0.574	19.808
ILNIQE	0.897	0.898	12.072	0.837	0.805	0.144	0.588	0.494	1.003	0.575	0.541	0.886	0.724	0.690	1.519	0.614	0.624	18.845
dipIQ	0.933	0.938	9.823	0.742	0.519	0.176	0.477	0.438	1.090	0.399	0.298	0.993	0.674	0.661	1.628	0.761	0.713	15.498
SNP-NIQE	0.899	0.907	11.962	0.702	0.609	0.187	0.432	0.333	1.118	0.442	0.372	0.971	0.750	0.726	1.456	0.626	0.625	18.620
NPQI	0.915	0.911	11.018	0.725	0.634	0.181	0.454	0.281	1.104	0.450	0.391	0.967	0.731	0.698	1.503	0.594	0.614	19.214
ContentSep	0.741	0.748	18.362	0.363	0.587	0.245	0.221	0.253	1.209	0.464	0.506	0.960	0.329	0.403	1.987	0.239	0.164	23.188
DSS-VGG16 (Ours)	0.828	0.840	15.325	0.747	0.717	0.175	0.574	0.443	1.016	0.576	0.561	0.885	0.742	0.712	1.476	0.577	0.567	19.567
DTS-VGG16 (Ours)	0.893	0.896	12.314	0.762	0.711	0.170	0.563	0.431	1.024	0.546	0.527	0.907	0.765	0.742	1.419	0.729	0.722	16.478
DSTS-VGG16 (Ours)	0.838	0.881	13.303	0.765	0.720	0.169	0.588	0.445	1.003	0.573	0.552	0.888	0.768	0.751	1.412	0.667	0.664	17.803
\cdashline1-19 DSS-ResNet50 (Ours)	0.859	0.869	14.006	0.774	0.743	0.169	0.569	0.473	0.993	0.588	0.567	0.875	0.716	0.697	1.534	0.632	0.628	18.504
DTS-ResNet50 (Ours)	0.872	0.879	13.358	0.770	0.732	0.167	0.542	0.456	1.042	0.559	0.542	0.878	0.744	0.728	1.472	0.783	0.781	14.849
DSTS-ResNet50 (Ours)	0.868	0.878	13.586	0.775	0.742	0.166	0.593	0.476	0.999	0.596	0.574	0.869	0.753	0.739	1.449	0.694	0.693	17.189
\cdashline1-19 DSS (Ours)	0.868	0.880	13.560	0.787	0.770	0.162	0.622	0.511	0.971	0.594	0.567	0.871	0.773	0.747	1.398	0.641	0.643	18.332
DTS (Ours)	0.932	0.936	9.890	0.791	0.777	0.161	0.624	0.536	0.969	0.594	0.598	0.871	0.807	0.758	1.300	0.792	0.787	14.578
DSTS (Ours)	0.931	0.935	9.954	0.823	0.798	0.155	0.610	0.538	0.936	0.637	0.622	0.820	0.806	0.795	1.304	0.799	0.803	14.137

TABLE II: Quality Prediction Performance Comparisons on Individual Database Containing Authentic Distortions. The First and Second Best Performance are Highlighted in Bold and Underline Respectively. Cases Where Shape Outperforms Texture are Marked in Red and Cases Where Texture Outperforms Shape are Marked in Green.

OU-IQA Methods	KonIQ			CLIVE			BID			CID2013			SPAQ
OU-IQA Methods	PLCC	SRCC	RMSE	PLCC	SRCC	RMSE	PLCC	SRCC	RMSE	PLCC	SRCC	RMSE	PLCC	SRCC	RMSE
NIQE	0.538	0.530	0.466	0.494	0.450	17.648	0.461	0.458	1.111	0.670	0.659	16.798	0.712	0.703	14.675
QAC	0.291	0.340	0.552	0.211	0.046	19.839	0.290	0.300	1.252	0.138	0.030	22.424	0.023	0.092	20.901
PIQUE	0.296	0.245	0.528	0.279	0.108	19.490	0.112	0.042	1.244	0.092	0.045	22.547	0.263	0.097	20.165
LPSI	0.106	0.224	0.552	0.299	0.083	19.368	0.099	0.043	1.246	0.436	0.323	20.376	0.274	0.068	20.103
ILNIQE	0.531	0.506	0.468	0.503	0.439	17.538	0.507	0.494	1.079	0.427	0.306	20.476	0.721	0.714	14.491
dipIQ	0.435	0.238	0.497	0.304	0.177	19.338	0.218	0.019	1.222	0.355	0.103	21.170	0.497	0.388	18.134
SNP-NIQE	0.640	0.628	0.424	0.520	0.465	17.343	0.437	0.425	1.126	0.726	0.716	15.570	0.746	0.739	13.912
NPQI	0.615	0.612	0.436	0.492	0.475	17.673	0.460	0.468	1.112	0.777	0.770	14.252	0.675	0.674	15.419
ContentSep	0.628	0.388	0.640	0.523	0.506	17.131	0.426	0.411	1.165	0.632	0.611	17.567	0.711	0.708	14.699
DSS-VGG16 (Ours)	0.720	0.712	0.384	0.519	0.465	17.351	0.457	0.429	1.114	0.808	0.799	13.339	0.787	0.784	12.902
DTS-VGG16 (Ours)	0.739	0.729	0.372	0.473	0.413	17.888	0.518	0.486	1.071	0.836	0.827	12.439	0.772	0.767	13.086
DSTS-VGG16 (Ours)	0.743	0.730	0.370	0.490	0.433	17.693	0.484	0.455	1.095	0.832	0.823	12.572	0.792	0.789	12.757
\cdashline1-16 DSS-ResNet50 (Ours)	0.666	0.669	0.412	0.459	0.409	18.029	0.396	0.354	1.149	0.856	0.846	11.080	0.760	0.758	13.577
DTS-ResNet50 (Ours)	0.641	0.636	0.424	0.392	0.339	18.676	0.467	0.430	1.107	0.852	0.846	11.872	0.735	0.732	14.176
DSTS-ResNet (Ours)	0.670	0.672	0.410	0.446	0.392	18.168	0.429	0.389	1.131	0.852	0.847	11.865	0.764	0.761	13.496
\cdashline1-16 DSS (Ours)	0.673	0.676	0.408	0.506	0.457	17.509	0.479	0.485	1.099	0.793	0.779	13.792	0.789	0.786	12.839
DTS (Ours)	0.712	0.733	0.388	0.536	0.482	17.131	0.579	0.565	1.022	0.869	0.853	11.218	0.747	0.741	13.889
DSTS (Ours)	0.752	0.753	0.370	0.539	0.483	17.101	0.577	0.565	1.023	0.875	0.860	11.108	0.806	0.801	12.383

IV-B Implementation Details

Backbone training To obtain the shape-texture oriented image statistics, we first train the feature extraction backbone EfficientNet-b7 [52] for the image classification task on the stylized-ImageNet, which is generated following the settings in [30]. The stylized-ImageNet is ImageNet processed with style transfer algorithms, which is believed as a way of eliminating texture information in images. Additionally, we also train the shape-biased VGG16 [75] and ResNet50 [76] for comparison. We use an initial learning rate 0.01 for the EfficientNet, and 0.1 for VGG16 and ResNet50, using the exponential decay strategy with the decay parameter of 0.1 for every 10 epochs. By default, the training batch size is set to 256, the weight decay is set to 5e-6, and the momentum for stochastic gradient descent (SGD) optimization is 0.9. For the texture-biased backbones, we directly adopt the VGG16, ResNet50, and EfficientNet-b7 pretrained on ImageNet [77].

Implementation of DSTS The convolutional responses of the second to the sixth stages are employed to extract image statistics, which have 32, 48, 80, 160, and 224 channels, respectively. The image shape-texture outer statistics are determined based on a subcollection with 500 images from the DIV2K [78] database, which are fixed and subsequently applied to all the testing databases. The filtering threshold $\tau$ of the structure indicator is set to the average value of $\bm{\tilde{I}}_{\sigma}$ . All models are constructed with PyTorch on a machine equipped with six NVIDIA Geforce RTX 4090 GPUs. It is worth noting that there is no content overlapping between the pristine image set and testing databases unless stated otherwise.

IV-C Performance Evaluation on Quality Prediction

We compare the performance of DSTS along with its shape component DSS and texture component DTS against nine existing OU-NRIQA methods that have public-available source codes, including NIQE [11], quality-aware clustering (QAC) [39], perception-based image quality evaluator (PIQUE) [40], LPSI [42], ILNIQE [12], dipIQ [17], SNP-NIQE [13], NPQI [55], and ContentSep [14]. Besides, we also compare the generalization ability with five popular OA-BIQA models under the cross-database setting, including PaQ2PiQ [79], HyperIQA [80], MANIQA [81], VCRNet [82], and MUSIQ [83]. For performance comparisons, we follow the official codes and parameter settings described in the papers.

IV-C1 Performance on Each Individual Database

TABLE III: Quality Prediction Results in Terms Of SRCC on Each Individual Distortion Type. The DSTS Performance is Highlighted in Purple, and the First and Second Best Performance are Highlighted in Bold and Underline Respectively.

Database	Distortion Type	NIQE	QAC	PIQUE	LPSI	ILNIQE	dipIQ	SNP-NIQE	NPQI	DSS	DTS	DSTS
TID2013	Additive Gaussian Noise	0.819	0.743	0.856	0.769	0.877	0.865	0.886	0.626	0.731	0.850	0.852
	Additive Noise in Color Components	0.670	0.718	0.758	0.496	0.816	0.769	0.732	0.297	0.679	0.738	0.741
	Spatially Correlated Noise	0.666	0.169	0.293	0.697	0.923	0.809	0.651	-0.233	0.782	0.815	0.817
	Masked Noise	0.746	0.704	0.593	0.046	0.513	0.725	0.738	0.662	0.569	0.649	0.645
	High Frequency Noise	0.845	0.863	0.892	0.925	0.869	0.864	0.873	0.821	0.819	0.888	0.886
	Impulse Noise	0.744	0.792	0.800	0.432	0.756	0.788	0.801	0.575	0.652	0.770	0.776
	Quantization Noise	0.850	0.709	0.751	0.854	0.871	0.799	0.857	0.773	0.668	0.870	0.872
	Gaussian Blur	0.797	0.846	0.828	0.841	0.815	0.905	0.863	0.759	0.866	0.861	0.863
	Image Denoising	0.590	0.338	0.644	-0.249	0.749	0.069	0.612	0.641	0.851	0.875	0.878
	JPEG Compression	0.843	0.837	0.793	0.912	0.834	0.912	0.878	0.847	0.833	0.895	0.895
	JPEG2000 Compression	0.889	0.790	0.854	0.899	0.858	0.919	0.881	0.851	0.914	0.933	0.931
	JPEG Transmission Errors	0.000	0.049	0.061	0.091	0.282	0.709	0.282	-0.031	0.510	0.424	0.427
	JPEG2000 Transmission Errors	0.511	0.407	0.113	0.611	0.524	0.369	0.592	-0.298	0.606	0.526	0.536
	Non Eccentricity Pattern Noise	-0.069	0.048	-0.010	0.052	-0.081	0.371	0.017	-0.025	-0.060	-0.016	-0.014
	Local Block-wise Distortions	-0.131	0.247	0.178	0.137	-0.132	0.291	-0.037	-0.072	0.028	0.089	0.101
	Mean Shift	-0.163	0.306	0.271	0.341	0.184	0.084	-0.122	-0.090	0.082	0.182	0.185
	Contrast Change	-0.017	-0.207	-0.072	0.199	0.014	-0.145	0.154	0.463	0.351	0.284	0.278
	Change of Color Saturation	-0.246	0.368	0.268	0.302	-0.165	0.068	-0.107	-0.346	0.283	0.511	0.508
	Multiplicative Gaussian Noise	0.693	0.790	0.732	0.696	0.694	0.788	0.741	0.396	0.583	0.758	0.762
	Comfort Noise	0.154	-0.152	-0.133	0.018	0.361	0.359	0.208	-0.345	0.436	0.299	0.301
	Lossy Compression of Noisy Images	0.803	0.640	0.637	0.236	0.829	0.851	0.831	0.373	0.800	0.846	0.849
	Color Quantization with Dither	0.783	0.873	0.812	0.900	0.750	0.756	0.789	0.756	0.609	0.804	0.801
	Chromatic Aberrations	0.562	0.625	0.676	0.695	0.679	0.700	0.634	0.534	0.854	0.718	0.723
	Sparse Sampling and Reconstruction	0.834	0.786	0.823	0.862	0.864	0.761	0.828	0.825	0.880	0.920	0.923
CSIQ	Additive Gaussian Noise	0.811	0.823	0.901	0.664	0.851	0.902	0.876	0.714	0.805	0.831	0.829
	Gaussian Blur	0.875	0.819	0.843	0.880	0.830	0.900	0.897	0.864	0.876	0.873	0.875
	Global Contrast Decrements	0.239	-0.250	0.093	0.543	0.518	-0.155	0.435	0.668	0.611	0.496	0.507
	Additive Pink Gaussian Noise	0.299	-0.004	0.120	0.247	0.877	-0.165	0.260	0.005	0.729	0.785	0.785
	JPEG Compression	0.862	0.878	0.824	0.927	0.876	0.925	0.906	0.887	0.883	0.912	0.913
	JPEG2000 Compression	0.894	0.865	0.848	0.900	0.894	0.936	0.896	0.877	0.910	0.934	0.934
LIVE	JPEG Compression	0.942	0.936	0.908	0.968	0.942	0.969	0.970	0.948	0.940	0.964	0.963
	JPEG2000 Compression	0.919	0.862	0.902	0.930	0.894	0.954	0.918	0.888	0.926	0.957	0.956
	White Noise	0.972	0.951	0.985	0.956	0.981	0.974	0.978	0.960	0.969	0.978	0.978
	Gaussian Blur	0.933	0.913	0.909	0.916	0.915	0.934	0.951	0.911	0.943	0.949	0.951
	Fast-fading	0.863	0.823	0.785	0.781	0.833	0.860	0.850	0.837	0.873	0.908	0.906
MDIVL	Blur + JPEG	0.760	0.554	0.658	0.735	0.791	0.651	0.809	0.774	0.778	0.804	0.806
MDIVL	Gaussian Noise + JPEG	0.444	0.528	0.401	0.469	0.580	0.773	0.552	0.537	0.659	0.778	0.776

We compare the performance of DSTS with nine OU-BIQA models on IQA databases that contain artificial and authentic distortions respectively. The comparisons in terms of SRCC, PLCC and RMSE are included in Table I and II. It can be observed from the results that DSTS obtains SOTA performance on eight out of eleven databases and achieves second-best performance on the other two databases, which indicates a competitive performance on both synthetic and authentic distortions. Besides, one can observe that DSTS shows great superiority on all the databases with authentic distortions, indicating the promising capability of capturing authentic distortions. In addition, we provide the illustration of the linear correlation between the predicted DSTS values and objective quality scores in SM, revealing DSTS possesses promising quality prediction accuracy. Furthermore, we find out that the performance of statistical comparison incorporating shape and texture bias preponderate on different databases, and a unified shape-texture statistical representation can achieve a stable and superior performance on all the databases. These experimental results demonstrate that even without the assistance of reference images and quality opinions, properly designed OU-BIQA models can still achieve a promising performance. In Fig. 4, we provide the differential mean opinion scores (DMOS) and predicted DSTS scores of eight distorted images corrupted from two reference images. One can observe the disturbance on image inner statistics becomes severer with the decrease of image quality, and DSTS increases correspondingly. This reveals that DSTS is able to measure perceptual quality effectively even in a completely-blind condition.

IV-C2 Performance on Each Individual Distortion Type

To better investigate DSTS performance on specific distortions, we conduct comparisons on LIVE, CSIQ, TID2013 and MDIVL. The results are listed in Table III. We can observe that regarding commonly encountered distortions such as “Gaussian Noise”, “JPEG/JP2K Compression”, “Gaussian Blur”, and “Fast-fading”, DSTS delivers promising results. Especially, for the “Image Denoising”, “Sparse Sampling and Reconstruction”, and “Additive Pink Gaussian Noise” subsets, none of the compared models can achieve acceptable performance except for DSTS. This reveals the proposed DSTS is able to generalize to different distortion types because it is not dedicatedly designed for any specific distortions.

IV-C3 Performance on Generative Distortions

Distortions sourced from generative models possess unique characters that are challenging for current IQA models [84]. To testify the effectiveness of DSTS on generative distortions, we compare its performance with NIQE and IL-NIQE on three GAN image IQA databases. Moreover, we also focus on the different utility of texture-biased and shape-biased statistics respectively, to obtain insights of the effectiveness of biased statistics on generative distortions. The results listed in Table IV show that the proposed DSTS achieves top-tier performance on generative distortion. Besides, we observe that shape-biased statistics contribute more to quality assessment in most cases, indicating that generative distortions disturb more shape-relevant image cues compared with texture-relevant ones.

TABLE IV: Quality Prediction Performance Comparisons on IQA Databases with Generative Distortions. The DSTS Performance is Highlighted in Purple. The Best Performer is highlighted in Bold. Cases Where Shape Outperforms Texture are Marked in Red and Cases Where Texture Outperforms Shape are Marked in Green.

OU-BIQA	LGIQA			GFIQA-20k	AGIQA-3k
OU-BIQA	city	cat	ffhq	GFIQA-20k	AGIQA-3k
NIQE	0.579	0.493	0.371	0.501	0.534
IL-NIQE	0.827	0.425	0.557	0.714	0.594
DSS-VGG16 (Ours)	0.831	0.448	0.482	0.778	0.688
DTS-VGG16 (Ours)	0.824	0.461	0.410	0.773	0.665
\cdashline1-6 DSS-ResNet50 (Ours)	0.826	0.435	0.458	0.754	0.653
DTS-ResNet50 (Ours)	0.808	0.449	0.384	0.705	0.604
\cdashline1-6 DSS	0.825	0.483	0.465	0.865	0.707
DTS	0.760	0.540	0.432	0.833	0.675
DSTS	0.810	0.542	0.479	0.843	0.693

IV-C4 Generalization Ability

We compare the proposed DSTS with five OA-BIQA methods to demonstrate our superiority on generalization ability. The results are listed in Table V. We observe that DSTS achieves SOTA in most cases without dataset-specific adjustment, indicating it possesses a high generalization ability even compared with OA methods. Such ability is enriched by the shape-texture statistical representations, which can benefit IQA in more universal applications.

TABLE V: Quality Prediction Performance Comparisons With OA-BIQA Methods In Terms of SRCC.The DSTS Performance is Highlighted in Purple. The First and Second Best Performer is highlighted in Bold and Underline.

Method	LIVE	CSIQ	TID2013	KADID	MDIVL	KonIQ	CLIVE
PaQ2PiQ	0.479	0.564	0.401	0.383	0.536	0.721	0.718
HyperIQA	0.755	0.581	0.384	0.468	0.617	-	0.761
MANIQA	0.779	0.662	0.451	0.438	0.511	-	0.840
VCRNet	-	0.681	0.512	0.444	0.475	0.606	0.557
MUSIQ	0.734	0.588	0.474	0.464	0.592	-	0.722
DSS	0.880	0.770	0.511	0.567	0.643	0.676	0.457
DTS	0.936	0.777	0.536	0.598	0.787	0.733	0.482
DSTS	0.935	0.798	0.538	0.622	0.803	0.753	0.483

V Applications on Personalized Blind Image Quality Assessment

The proposed DSTS possesses the significant capability of capturing quality characteristics of images, by unifying a shape-texture statistical description. Upon this specialty, we further extend the DSTS to the application of personalized blind image quality prediction. As shown in Fig. 6, the uniqueness of IQA personalization is that it measures an individual’s quality preference instead of the average perception characterized by MOS or DMOS. In application, for a specific user, DSTS can conduct personalized quality prediction of distorted images based on the user-tailored outer statistics.

TABLE VI: Comparison Results In Terms Of Average SRCC Across Images Rated By Each Subject On The Kadid Database.

Method		M>10	M>100
OA-BIQA	BRISQUE	0.088 $\pm$ 0.004	0.102 $\pm$ 0.003
	CORNIA	0.094 $\pm$ 0.005	0.101 $\pm$ 0.007
	CLIP-IQA	0.171 $\pm$ 0.015	0.181 $\pm$ 0.020
OU-BIQA	NIQE	0.108 $\pm$ 0.002	0.101 $\pm$ 0.013
	IL-NIQE	0.123 $\pm$ 0.007	0.124 $\pm$ 0.012
	DSTS	0.352 $\pm$ 0.003	0.398 $\pm$ 0.010

More specifically, we conduct experiments on the KADID database [64], because it contains raw quality ratings from each subject of each distorted image aside from DMOSs. We consider the images labeled by the same user as a set. To achieve personalized quality assessment with DSTS, the personalized outer statistics are derived from the highest-rated images in each set, and 10 other images rated by the same subject are sampled as the test set. The performance is evaluated by the rank consistency SRCC between quality ratings and DSTS values for each subject, and the final SRCC for each IQA model is acquired by averaging the SRCCs of all subjects. Since DSTS requires a reasonable number of pristine images to describe the natural image domain statistics, we only consider the subjects with the number of highest-rated images $M\geq 10$ . Moreover, the results for the case $M>100$ are also provided for comparison. We compare DSTS with three OA-BIQA methods BRISQUE [4], CORNIA [7], CLIP-IQA [85], and two OU-BIQA models NIQE [11] and IL-NIQE [12]. The results are summarized in Table VI. We also show distorted images rated by the same subject together with the quality ratings and predicted DSTS values in Fig. 5. It is apparent that DSTS successfully achieves higher personalized rank correlation with individual subject’s ratings and shows superiority over other OA and OU methods, which manifests DSTS’s promising capability on the application of personalized BIQA.

VI Conclusions

In this paper, we quest the capability of deep features produced by DNNs for quality prediction by establishing a unified shape-texture statistical representation. Based on this representation, we further build an OU-BIQA model DSTS by formulating the inner and outer statistics on the basis of adaptively-merged shape and texture deep features. By measuring the statistical distance between the inner and outer shape-texture statistics, the visual quality of distorted images can be assessed effectively without any assistant from reference images and MOSs in training. Extensive experiments provide useful evidence that DSTS’s quality predictions closely align with human perception, and generalize well on both artificial and authentic distortions. It also shows superiority on personalized image quality preference prediction. The promising performance of DSTS indicates the prospective for OU-BIQA methods, and might assist to reduce the over-dependence on human-labeled data for blind image quality assessment.

References

[1] A. K. Moorthy and A. C. Bovik, “A two-step framework for constructing blind image quality indices,” IEEE Signal processing letters, vol. 17, no. 5, pp. 513–516, 2010.
[2] ——, “Blind image quality assessment: From natural scene statistics to perceptual quality,” IEEE transactions on Image Processing, vol. 20, no. 12, pp. 3350–3364, 2011.
[3] M. A. Saad, A. C. Bovik, and C. Charrier, “Blind image quality assessment: A natural scene statistics approach in the dct domain,” IEEE transactions on Image Processing, vol. 21, no. 8, pp. 3339–3352, 2012.
[4] A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference image quality assessment in the spatial domain,” IEEE Transactions on image processing, vol. 21, no. 12, pp. 4695–4708, 2012.
[5] H. Tang, N. Joshi, and A. Kapoor, “Learning a blind measure of perceptual image quality,” in CVPR 2011. IEEE, 2011, pp. 305–312.
[6] C. Li, A. C. Bovik, and X. Wu, “Blind image quality assessment using a general regression neural network,” IEEE Transactions on neural networks, vol. 22, no. 5, pp. 793–799, 2011.
[7] P. Ye, J. Kumar, L. Kang, and D. Doermann, “Unsupervised feature learning framework for no-reference image quality assessment,” in 2012 IEEE conference on computer vision and pattern recognition. IEEE, 2012, pp. 1098–1105.
[8] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-CAM: Visual explanations from deep networks via gradient-based localization,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 618–626.
[9] L. Van der Maaten and G. Hinton, “Visualizing data using t-SNE.” Journal of machine learning research, vol. 9, no. 11, 2008.
[10] A. Mittal, G. S. Muralidhar, J. Ghosh, and A. C. Bovik, “Blind image quality assessment without human training using latent quality factors,” IEEE Signal Processing Letters, vol. 19, no. 2, pp. 75–78, 2011.
[11] A. Mittal, R. Soundararajan, and A. C. Bovik, “Making a “completely blind” image quality analyzer,” IEEE Signal processing letters, vol. 20, no. 3, pp. 209–212, 2012.
[12] L. Zhang, L. Zhang, and A. C. Bovik, “A feature-enriched completely blind image quality evaluator,” IEEE Transactions on Image Processing, vol. 24, no. 8, pp. 2579–2591, 2015.
[13] Y. Liu, K. Gu, Y. Zhang, X. Li, G. Zhai, D. Zhao, and W. Gao, “Unsupervised blind image quality evaluation via statistical measurements of structure, naturalness, and perception,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 4, pp. 929–943, 2019.
[14] N. C. Babu, V. Kannan, and R. Soundararajan, “No reference opinion unaware quality assessment of authentically distorted images,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 2459–2468.
[15] A. Shukla, A. Upadhyay, S. Bhugra, and M. Sharma, “Opinion unaware image quality assessment via adversarial convolutional variational autoencoder,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 2153–2163.
[16] Z. Wang, Z.-R. Tang, J. Zhang, and Y. Fang, “Toward a blind image quality evaluator in the wild by learning beyond human opinion scores,” Pattern Recognition, vol. 137, p. 109296, 2023.
[17] K. Ma, W. Liu, T. Liu, Z. Wang, and D. Tao, “dipIQ: Blind image quality assessment by learning-to-rank discriminable image pairs,” IEEE Transactions on Image Processing, vol. 26, no. 8, pp. 3951–3964, 2017.
[18] P. Ye, J. Kumar, and D. Doermann, “Beyond human opinion scores: Blind image quality assessment based on synthetic scores,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 4241–4248.
[19] Y. Zhu, H. Ma, J. Peng, D. Liu, and Z. Xiong, “Recycling discriminator: Towards opinion-unaware image quality assessment using wasserstein gan,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 116–125.
[20] D. L. Ruderman, “The statistics of natural images,” Network: computation in neural systems, vol. 5, no. 4, p. 517, 1994.
[21] A. Srivastava, A. B. Lee, E. P. Simoncelli, and S.-C. Zhu, “On advances in statistical modeling of natural images,” Journal of mathematical imaging and vision, vol. 18, pp. 17–33, 2003.
[22] E. P. Simoncelli and B. A. Olshausen, “Natural image statistics and neural representation,” Annual review of neuroscience, vol. 24, no. 1, pp. 1193–1216, 2001.
[23] J.-M. Geusebroek and A. W. Smeulders, “A six-stimulus theory for stochastic texture,” International Journal of Computer Vision, vol. 62, pp. 7–16, 2005.
[24] S. A. Amirshahi, M. Pedersen, and S. X. Yu, “Image quality assessment by comparing CNN features between images,” Journal of Imaging Science and Technology, vol. 60, 2016.
[25] F. Gao, Y. Wang, P. Li, M. Tan, J. Yu, and Y. Zhu, “DeepSim: Deep similarity for image quality assessment,” Neurocomputing, vol. 257, pp. 104–114, 2017.
[26] A. Berardino, V. Laparra, J. Ballé, and E. Simoncelli, “Eigen-distortions of hierarchical representations,” Advances in neural information processing systems, vol. 30, 2017.
[27] J. Kim and S. Lee, “Deep learning of human visual sensitivity in image quality assessment framework,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1676–1684.
[28] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 586–595.
[29] X. Liao, B. Chen, H. Zhu, S. Wang, M. Zhou, and S. Kwong, “DeepWSD: Projecting degradations in perceptual space to wasserstein distance in deep feature space,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 970–978.
[30] R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel, “ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness,” arXiv preprint arXiv:1811.12231, 2018.
[31] M. A. Islam, M. Kowal, P. Esser, S. Jia, B. Ommer, K. G. Derpanis, and N. Bruce, “Shape or texture: Understanding discriminative features in CNNs,” arXiv preprint arXiv:2101.11604, 2021.
[32] C. K. Mummadi, R. Subramaniam, R. Hutmacher, J. Vitay, V. Fischer, and J. H. Metzen, “Does enhanced shape bias improve neural network robustness to common corruptions?” arXiv preprint arXiv:2104.09789, 2021.
[33] M. M. Naseer, K. Ranasinghe, S. H. Khan, M. Hayat, F. Shahbaz Khan, and M.-H. Yang, “Intriguing properties of vision transformers,” Advances in Neural Information Processing Systems, vol. 34, pp. 23 296–23 308, 2021.
[34] B. Landau, L. B. Smith, and S. S. Jones, “The importance of shape in early lexical learning,” Cognitive development, vol. 3, no. 3, pp. 299–321, 1988.
[35] S. C. Kucker, L. K. Samuelson, L. K. Perry, H. Yoshida, E. Colunga, M. G. Lorenz, and L. B. Smith, “Reproducibility and a unifying explanation: Lessons from the shape bias,” Infant Behavior and Development, vol. 54, pp. 156–165, 2019.
[36] Y. Zhang, W. Lin, Q. Li, W. Cheng, and X. Zhang, “Multiple-level feature-based measure for retargeted image quality,” IEEE Transactions on Image Processing, vol. 27, no. 1, pp. 451–463, 2017.
[37] L. Li, Y. Li, J. Wu, L. Ma, and Y. Fang, “Quality evaluation for image retargeting with instance semantics,” IEEE Transactions on Multimedia, vol. 23, pp. 2757–2769, 2020.
[38] K. Ding, Y. Liu, X. Zou, S. Wang, and K. Ma, “Locally adaptive structure and texture similarity for image quality assessment,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 2483–2491.
[39] W. Xue, L. Zhang, and X. Mou, “Learning without human scores for blind image quality assessment,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 995–1002.
[40] N. Venkatanath, D. Praneeth, M. C. Bh, S. S. Channappayya, and S. S. Medasani, “Blind image quality evaluation using perception based features,” in 2015 twenty first national conference on communications. IEEE, 2015, pp. 1–6.
[41] X. Liu, J. Van De Weijer, and A. D. Bagdanov, “RankIQA: Learning from rankings for no-reference image quality assessment,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 1040–1049.
[42] Q. Wu, Z. Wang, and H. Li, “A highly efficient method for blind image quality assessment,” in 2015 IEEE International Conference on Image Processing. IEEE, 2015, pp. 339–343.
[43] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
[44] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, 2012.
[45] X. Qiu, M. Kan, Y. Zhou, Y. Bi, and S. Shan, “Shape-biased cnns are not always superior in out-of-distribution robustness,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 2326–2335.
[46] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the wild,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 3730–3738.
[47] J. Zhang, D. Zhao, R. Xiong, S. Ma, and W. Gao, “Image restoration using joint statistical modeling in a space-transform domain,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 24, no. 6, pp. 915–928, 2014.
[48] T. R. Shaham, T. Dekel, and T. Michaeli, “SinGAN: Learning a generative model from a single natural image,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 4570–4580.
[49] A. Shocher, S. Bagon, P. Isola, and M. Irani, “InGAN: Capturing and retargeting the" DNA" of a natural image,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 4492–4501.
[50] E. Heitz, K. Vanhoey, T. Chambon, and L. Belcour, “A sliced wasserstein loss for neural texture synthesis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 9412–9420.
[51] I. N. Bronstein, J. Hromkovic, B. Luderer, H.-R. Schwarz, J. Blath, A. Schied, S. Dempe, G. Wanka, and S. Gottwald, Taschenbuch der mathematik. Springer-Verlag, 2012, vol. 1.
[52] M. Tan and Q. Le, “EfficientNet: Rethinking model scaling for convolutional neural networks,” in International conference on machine learning. PMLR, 2019, pp. 6105–6114.
[53] I. Kligvasser, T. Shaham, Y. Bahat, and T. Michaeli, “Deep self-dissimilarities as powerful visual fingerprints,” Advances in Neural Information Processing Systems, vol. 34, pp. 3939–3951, 2021.
[54] B. Julesz, “Visual pattern discrimination,” IRE transactions on Information Theory, vol. 8, no. 2, pp. 84–92, 1962.
[55] Y. Liu, K. Gu, X. Li, and Y. Zhang, “Blind image quality assessment by natural scene statistics and perceptual characteristics,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 16, no. 3, pp. 1–91, 2020.
[56] Z. Duanmu, W. Liu, Z. Wang, and Z. Wang, “Quantifying visual image quality: A bayesian view,” Annual Review of Vision Science, vol. 7, pp. 437–464, 2021.
[57] L. Gatys, A. S. Ecker, and M. Bethge, “Texture synthesis using convolutional neural networks,” Advances in neural information processing systems, vol. 28, 2015.
[58] J. Rabin, G. Peyré, J. Delon, and M. Bernot, “Wasserstein barycenter and its application to texture mixing,” in Scale Space and Variational Methods in Computer Vision: Third International Conference, SSVM 2011, Ein-Gedi, Israel, May 29–June 2, 2011, Revised Selected Papers 3. Springer, 2012, pp. 435–446.
[59] N. Kolkin, J. Salavon, and G. Shakhnarovich, “Style transfer by relaxed optimal transport and self-similarity,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 10 051–10 060.
[60] H. Zhu, B. Chen, L. Zhu, S. Wang, and W. Lin, “DeepDC: Deep distance correlation as a perceptual image quality evaluator,” arXiv preprint arXiv:2211.04927, 2022.
[61] H. R. Sheikh, “Image and video quality assessment research at LIVE,” http://live. ece. utexas. edu/research/quality/, 2003.
[62] E. C. Larson and D. M. Chandler, “Most apparent distortion: full-reference image quality assessment and the role of strategy,” Journal of electronic imaging, vol. 19, no. 1, pp. 011 006–011 006, 2010.
[63] N. Ponomarenko, L. Jin, O. Ieremeiev, V. Lukin, K. Egiazarian, J. Astola, B. Vozel, K. Chehdi, M. Carli, F. Battisti et al., “Image database TID2013: Peculiarities, results and perspectives,” Signal processing: Image communication, vol. 30, pp. 57–77, 2015.
[64] H. Lin, V. Hosu, and D. Saupe, “KADID-10k: A large-scale artificially distorted iqa database,” in 2019 Eleventh International Conference on Quality of Multimedia Experience. IEEE, 2019, pp. 1–3.
[65] S. Corchs, F. Gasparini, and R. Schettini, “No reference image quality classification for jpeg-distorted images,” Digital Signal Processing, vol. 30, pp. 86–100, 2014.
[66] W. Sun, F. Zhou, and Q. Liao, “MDID: A multiply distorted image database for image quality assessment,” Pattern Recognition, vol. 61, pp. 153–168, 2017.
[67] V. Hosu, H. Lin, T. Sziranyi, and D. Saupe, “KonIQ-10k: An ecologically valid database for deep learning of blind image quality assessment,” IEEE Transactions on Image Processing, vol. 29, pp. 4041–4056, 2020.
[68] D. Ghadiyaram and A. C. Bovik, “Massive online crowdsourced study of subjective and objective picture quality,” IEEE Transactions on Image Processing, vol. 25, no. 1, pp. 372–387, 2015.
[69] A. Ciancio, E. A. da Silva, A. Said, R. Samadani, P. Obrador et al., “No-reference blur assessment of digital pictures based on multifeature classifiers,” IEEE Transactions on image processing, vol. 20, no. 1, pp. 64–75, 2010.
[70] T. Virtanen, M. Nuutinen, M. Vaahteranoksa, P. Oittinen, and J. Häkkinen, “CID2013: A database for evaluating no-reference image quality assessment algorithms,” IEEE Transactions on Image Processing, vol. 24, no. 1, pp. 390–402, 2014.
[71] Y. Fang, H. Zhu, Y. Zeng, K. Ma, and Z. Wang, “Perceptual quality assessment of smartphone photography,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3677–3686.
[72] C. Li, Z. Zhang, H. Wu, W. Sun, X. Min, X. Liu, G. Zhai, and W. Lin, “AGIQA-3K: An open database for ai-generated image quality assessment,” arXiv preprint arXiv:2306.04717, 2023.
[73] S. Su, H. Lin, V. Hosu, O. Wiedemann, J. Sun, Y. Zhu, H. Liu, Y. Zhang, and D. Saupe, “Going the extra mile in face image quality assessment: A novel database and model,” IEEE Transactions on Multimedia, 2023.
[74] S. Gu, J. Bao, D. Chen, and F. Wen, “GIQA: Generated image quality assessment,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16. Springer, 2020, pp. 369–385.
[75] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[76] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[77] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “ImageNet large scale visual recognition challenge,” International journal of computer vision, vol. 115, pp. 211–252, 2015.
[78] E. Agustsson and R. Timofte, “Ntire 2017 challenge on single image super-resolution: Dataset and study,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2017, pp. 126–135.
[79] Z. Ying, H. Niu, P. Gupta, D. Mahajan, D. Ghadiyaram, and A. Bovik, “From patches to pictures (PaQ-2-PiQ): Mapping the perceptual space of picture quality,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3575–3585.
[80] S. Su, Q. Yan, Y. Zhu, C. Zhang, X. Ge, J. Sun, and Y. Zhang, “Blindly assess image quality in the wild guided by a self-adaptive hyper network,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3667–3676.
[81] S. Yang, T. Wu, S. Shi, S. Lao, Y. Gong, M. Cao, J. Wang, and Y. Yang, “MANIQA: Multi-dimension attention network for no-reference image quality assessment,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1191–1200.
[82] Z. Pan, F. Yuan, J. Lei, Y. Fang, X. Shao, and S. Kwong, “VCRNet: Visual compensation restoration network for no-reference image quality assessment,” IEEE Transactions on Image Processing, vol. 31, pp. 1613–1627, 2022.
[83] J. Ke, Q. Wang, Y. Wang, P. Milanfar, and F. Yang, “MUSIQ: Multi-scale image quality transformer,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 5148–5157.
[84] G. Jinjin, C. Haoming, C. Haoyu, Y. Xiaoxing, J. S. Ren, and D. Chao, “PIPAL: a large-scale image quality assessment dataset for perceptual image restoration,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16. Springer, 2020, pp. 633–651.
[85] J. Wang, K. C. Chan, and C. C. Loy, “Exploring CLIP for assessing the look and feel of images,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 2, 2023, pp. 2555–2563.