ASD: Towards Attribute Spatial Decomposition for Prior-Free Facial Attribute Recognition

Chuanfei Hu \IEEEmembershipStudent Member, IEEE Hang Shao Bo Dong Zhe Wang and Yongxiong Wang \IEEEmembershipMember, IEEE Chuanfei Hu, Zhe Wang and Yongxiong Wang are with School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai, China (e-mail: chuanfei_hu@ieee.org, 201440049@st.usst.edu.cn, wyxiong@usst.edu.cn).Hang Shao is with School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China (e-mail: shaohang@njust.edu.cn).Bo Dong is with Center for Brain Imaging Science and Technology, Zhejiang University, Zhejiang, China (e-mail: bodong.cv@gmail.com).This paragraph will include the Associate Editor who handled your paper.

Abstract

Representing the spatial properties of facial attributes is a vital challenge for facial attribute recognition (FAR). Recent advances have achieved the reliable performances for FAR, benefiting from the description of spatial properties via extra prior information. However, the extra prior information might not be always available, resulting in the restricted application scenario of the prior-based methods. Meanwhile, the spatial ambiguity of facial attributes caused by inherent spatial diversities of facial parts is ignored. To address these issues, we propose a prior-free method for attribute spatial decomposition (ASD), mitigating the spatial ambiguity of facial attributes without any extra prior information. Specifically, assignment-embedding module (AEM) is proposed to enable the procedure of ASD, which consists of two operations: attribute-to-location assignment and location-to-attribute embedding. The attribute-to-location assignment first decomposes the feature map based on latent factors, assigning the magnitude of attribute components on each spatial location. Then, the assigned attribute components from all locations to represent the global-level attribute embeddings. Furthermore, correlation matrix minimization (CMM) is introduced to enlarge the discriminability of attribute embeddings. Experimental results demonstrate the superiority of ASD compared with state-of-the-art prior-based methods, while the reliable performance of ASD for the case of limited training data is further validated.

{IEEEImpStatement}

Facial attribute recognition (FAR) can label the facial attributes of a person from a facial image, which has been widely applied to numerous real-world applications, such as facial verification and identification, face generation, and face retrieval. The existing FAR methods introduce some prior information to achieve the satisfactory performances. The prior information might not be always available, resulting in the restricted application scenario of these methods. The proposed attribute spatial decomposition (ASD) can construct a prior-free FAR method, which achieves superior performance without any prior information. The experimental results verify the effect of ASD, which could fill the blank of prior-free method in FAR task.

{IEEEkeywords}

Facial attribute recognition, deep learning, attribute decomposition, prior-free method.

1 Introduction

\IEEEPARstart

Facial attribute recognition (FAR) aims to predict multiple biometrics from a given facial image, such as gender, mustache, lip thickness, and hair color, which has be widely contributed to numerous real-world applications, e.g., facial verification and identification [1, 2, 3, 4], face generation [5, 6], and face retrieval [7, 8]. However, FAR is still an open issue in the wild, since facial appearance is variable significantly caused by complicated factors of capturing, such as illumination, pose, and occlusion.

Refer to caption — Figure 1: The spatial ambiguity of facial attributes from the statistical observation. We first plot the coordinates of the facial landmarks, including nose, left eye, right eye, left mouth, and right mouth for all individuals in CelebA [9]. Meanwhile, we further construct the convex hulls of the corresponding coordinate sets. The obvious overlapping regions among the convex hulls illustrate that a spatial location might contain the various facial attributes for different individuals, revealing the spatial ambiguity of facial attributes.

Recent efforts have been made towards improving deep learning-based FAR methods [10, 11, 12, 13, 14, 15, 16] via additional annotations, which can be seen as an insight of modeling the spatial properties of facial attributes via extra prior information. These prior-based methods can be categorized as the explicit and implicit prior information-based. The explicit prior information-based methods introduce additional annotations, such as facial landmarks [13], identifications [11, 15], and face parsing masks [12, 15], to construct a multi-task learning or self-supervised learning, in which the additional annotations can be regarded as the prior information to enhance the discriminability between the features of facial attributes. On the other hand, the implicit prior information-based methods design a multi-task learning network via attribute groups [14, 10, 11, 16], which groups the attributes manually according to locations or semantics. Although these methods achieve the promising results in FAR, there are three major drawbacks. First, the explicit prior information might not be always available, resulting in the restricted application scenario of these methods. Meanwhile, the manual grouping of facial attributes is not suitable and optimal, since different individuals might give different partitions according to locations or semantics. It means that the relationships among facial attributes could not be defined sufficiently via prior experiences. Furthermore, the inherent spatial diversities of facial parts, caused by different individuals and various poses, are considered insufficiently. For instance, we illustrate the distributions of facial landmarks, including nose, eyes, and mouth, from all individuals in CelebA [9], as shown in Figure 1. The conspicuous overlapping regions of the convex hulls reveal that the spatial ambiguity of facial attributes is existed inherently. In the above mentioned FAR methods, flattening and global pooling are utilized to vectorize the convolutional features, in which the spatial properties are diluted, resulting in the spatial ambiguity of facial attributes, as shown in Figure 2. Therefore, a challenging issue remains:

“Can we design a method to alleviate the spatial ambiguity of facial attributes without any extra prior information” ?

Specifically, the spatial ambiguity of facial attributes means that multiple attribute features might be represented on the same spatial location of feature map. Intuitively, to alleviate the spatial ambiguity, conducting a classifier to recognize the attributes from each spatial location of feature map might be an available approach. However, the traverse of all positions would produce redundant predictions. Since the spatial-wise labels of attributes are unavailable, we have to build a full label-space with all attributes for each spatial location, resulting in many trivial prediction results. Furthermore, a strategy of integrating all predictions would raise the issue of increase in complexity.

In this paper, we focus on FAR method without any extra prior information, denoted as prior-free FAR method. A novel module whose capability of attribute spatial decomposition (ASD) is proposed, termed as assignment-embedding module (AEM). Motivated by hidden factor analysis (HFA) [17], a facial image can be seen as a combination of facial attribute components. Therefore, ASD can be intuitively cast as decomposing the attribute components in the spatial dimension via latent factors, incorporated into a paradigm of supervised learning-based FAR. This is achieved via two operations of attribute-to-location assignment and location-to-attribute embedding. The attribute-to-location assignment aims to decompose the feature map based on latent factors, describing the attribute components on each spatial location. Then, the location-to-attribute embedding aggregates the assigned attribute components from all locations to represent the global-level attribute embeddings. To further enlarge the discriminability of attribute embeddings, correlation matrix minimization (CMM) is introduced to constrain the correlations among latent factors, muting the impact caused by relationships among latent factors on the procedure of assignment. The advantage of ASD is to describe the attribute components formally without any extra prior information. To summarize, the main contributions are as follows:

•

We propose a prior-free FAR method which can decompose the attribute components in the spatial dimension, thereby mitigating the spatial ambiguity of facial attributes. To the best of our knowledge, we are among the first to introduce the insight of attribute spatial decomposition (ASD) for improving the performance of FAR.
•

A novel module, termed as assignment-embedding module (AEM), is proposed to enable the procedure of ASD via attribute-to-location assignment and location-to-attribute embedding.
•

Correlation matrix minimization (CMM) is introduced to decorrelate the latent factors, strengthening the discriminability of the decomposed attribute embeddings.
•

The effectiveness of ASD is demonstrated via ablation studies, and the superior performances are achieved on CelebA [9] and LFWA [9] without any extra prior information. Moreover, the superiority of ASD adopted to the case of limited training data is validated experimentally.

The remainder of the paper is organized as follows. In Section 2, we briefly describe the related works on deep learning-based FAR methods and the decomposition of facial components. Section 3 presents the details of ASD. The experimental results are illustrated in Section 4. The conclusion is finally drawn in Section 5.

2 Related Works

2.1 Facial attribute recognition

With the significant success of deep learning in computer vision community, deep learning-based methods have emerged in the field of FAR. Liu et al. [9] introduce a large scale benchmark for FAR in the wild, termed as CelebFaces Attribute (CelebA). Meanwhile, a deep learning framework is proposed, which consists of two localization networks (LNet) and an attribute recognition network (ANet) for face localization and facial attribute prediction. Rudd et al. [18] cast facial attribute classification as a multi-task problem, in which the learning of each facial attribute is treated as a different task with joint optimization. Hand et al. [19] design a multi-task deep convolutional network (MCNN) based on the relationships between facial attributes. Specifically, 40 facial attributes are first divided into 9 groups according to the facial spatial locations. Then, the shallow layers of MCNN are shared for all attribute groups, while the deep layers are independent belong to corresponding group. The attribute groups are treated as different tasks, and a multi-task learning paradigm is conducted to improve the final performances of FAR. Han et al. [20] estimate the facial attributes via representing the correlation and heterogeneity of attributes in a multi-task learning framework, in which the heterogeneity means the attribute data type, scale, and semantic meaning. Cao et al. [11] introduce identification information to boost a partially shared multi-task convolutional neural network (PS-MCNN) for FAR. The attribute group-wise representations are shared hierarchically among five CNNs. He et al. [12] conduct a dual-path FAR network to leverage features from the original face images and facial abstraction images., where the facial abstraction images are generated by a generative adversarial network (GAN). Chen et al. [16] introduce group attention learning and graph correlation learning to discover correlations among facial parts, while Multi-scale Group and Graph Network (MGG-Net) is proposed for FAR. Shu et al. [15] design three facial-related auxiliary tasks to learn spatial-semantic relationships between facial attributes, and then the spatial-semantic knowledge are transferred to FAR task.

The above mentioned methods often achieve the competitive performances relying on extra prior information. However, the extra prior information, such as identifications, landmarks, and face parsing masks, might be not available, resulting the limitation of these methods. In this paper, we focus on modeling FAR method without any extra prior information.

2.2 Decomposition of facial components

Face is composed of different characteristics physiologically, such as eyes, nose, and lips. Therefore, facial representation in facial analysis tasks can be intuitively decomposed into various task-related facial components. Here, we focus on the deep learning-based facial analysis methods with the decomposition. Inspired by hidden factor analysis (HFA) [17], Wen et al. [21] propose a latent factor guided CNN (LF-CNN) to learn the age-invariant deep facial features for age-invariant face recognition. The observed facial features are formulated as a linear combination including the latent factors in terms of identification, age, noise and mean of facial feature. Li et al. [22] propose a twin-cycle autoencoder (TCAE) for action unit (AU) detection based on a self-supervised learning paradigm. The movements are factorized into AU-related and pose-related displacements among a pair of images, in which facial action-related movements can be disentangled from head motion-related movements. Chen et al. [23] propose a semantic component decomposition for face attribute manipulation. Facial attribute is decomposed into multiple semantic components, each corresponds to a specific face region. Alharbi et al. [24] exploit a grid structure-based noise injection to disentangle the latent space of facial image generation. The several aspects of disentanglement achieve fine-grained control over the generated images. Cheng et al. [25] propose a plug-and-play self-adversarial network to decompose facial feature. The network simultaneously removes entire image feature and preserves gaze-related feature, improving the robust performance of facial gaze estimation.

The above methods of facial analysis decompose the face into global-level facial components, ignoring the ambiguity of facial components on the local locations. Here, we propose ASD which focuses on the decomposition of facial attribute components in the spatial dimension based on the spirit of HFA. The attribute components are described via latent factors on the each spatial location to alleviate the spatial ambiguity. It is worth noting that the formulation of AEM is kind of similar to part localization module (PLM) in [26], but there is an obvious difference. PLM only conducts the learnable vectors to decompose attribute-related information from the feature map, ignoring to consider attribute-irrelated information. In contrast, AEM represent the general formulation for the feature map, which is guided with latent factors including attributes, noise and mean of facial feature. Therefore, PLM can be regarded as a special case of AEM.

3 Attribute Spatial Decomposition

3.1 Overview

ASD aims to represent facial attribute components for each spatial location of a facial image via latent factors, alleviating the spatial ambiguity without any extra prior information. AEM is a key idea in the procedure of ASD, which is adapted for deep learning-based FAR framework, as shown in Figure 3. Specifically, given a facial image $X$ whose feature map $f\in\mathbb{R}^{H\times W\times C}$ is first obtained via a CNN-based feature extractor $F$ as follows:

f=F(X;\theta_{F}),

(1)

where $\theta_{F}$ denotes the learnable parameters of feature extractor, $C$ is the number of channels and $H\times W$ is the resolution. Then, AEM is modeled to decompose the various attribute components of $f$ in the spatial dimension via attribute-to-location assignment, while the attribute components are integrated via location-to-attribute embedding. The two operations of AEM can be formulated as follows:

g=\psi(\phi(f;z),f),

(2)

where $z$ represents the latent factors, $\phi$ and $\psi$ denote attribute-to-location assignment and location-to-attribute embedding, respectively. Finally, a classifier $P$ is conducted to project the integrated attribute embeddings $g$ to the label-space as follows:

p=P(g;w_{P},b_{P}),

(3)

where $w_{P}$ and $b_{P}$ denote the learnable parameters of classifier.

3.2 Assignment-Embedding Module

AEM is in charge of decomposing the attribute components via attribute-to-location assignment and location-to-attribute embedding. Attribute-to-location assignment generates an assignment matrix via similarities between feature map and latent factors. The assignment matrix explicitly describes the magnitudes of attribute components on each spatial location, which can be seen as a HFA-like formulation. It is worth noting that we generally formulate the feature map including the components of attributes, noise, and mean of facial feature. Location-to-attribute embedding integrates attribute components from all locations guided with the assignment matrix to represent the attribute embeddings of the entire image.

3.2.1 Attribute-to-location Assignment

The feature map $f$ is first flattened as $f^{\prime}$ with the size of $M\times C$ , where $M=H\times W$ . Then, the HFA-like formulation of the $i$ -th spatial location $\hat{f}_{i}$ can be defined as follows:

\hat{f}_{i}=\underbrace{\sum_{j=1}^{D}\alpha_{i,j}f_{i}+\alpha_{i,D+1}f_{i}+\bar{f},}_{\text{attributes, noise, and mean of feature}}

(4)

where $\bar{f}\in\mathbb{R}^{C}$ is the mean of feature map, $f_{i}\in\mathbb{R}^{C}$ denotes the feature on the $i$ -th spatial location of $f^{\prime}$ . The components of attributes, noise and mean of facial feature are combined linearly with assigning magnitudes. Equation 4 can explicitly describe the magnitude of each component in the spatial dimension, alleviating the spatial ambiguity of facial attributes formally. $\alpha_{i,j}$ is the assigning magnitude in the $(i,j)$ position of assignment matrix $A$ which can be generated as follows:

\bm{\mathrm{A}}=\left[\begin{array}[]{ccc}\frac{e^{s(z_{1},f_{1})}}{\sum_{j=1}^{D+1}e^{s(z_{j},f_{1})}}&\cdots&\frac{e^{s(z_{D+1},f_{1})}}{\sum_{j=1}^{D+1}e^{s(z_{j},f_{1})}}\\ \vdots&\ddots&\vdots\\ \frac{e^{s(z_{1},f_{M})}}{\sum_{j=1}^{D+1}e^{s(z_{j},f_{M})}}&\cdots&\frac{e^{s(z_{D+1},f_{M})}}{\sum_{j=1}^{D+1}e^{s(z_{j},f_{M})}}\end{array}\right],

(5)

where $\bm{\mathrm{A}}\in\mathbb{R}^{M\times(D+1)}$ denotes the assignment matrix. The assigning magnitude $\alpha$ is defined to describe the magnitude of components for each spatial location, including $D$ attributes and noise. $s$ denotes a cosine similarity between $f^{\prime}$ and $z$ , which is formulated as follows:

s(a,b)=\frac{a^{\mathsf{T}}b}{|a||b|},

(6)

where $|\cdot|$ is the modulo operation of vector. $z_{j}\in\mathbb{R}^{C}$ is the $j$ -th learnable latent factor of $z$ , representing the $j$ -th attribute in the latent space. $z_{D+1}$ denotes the latent noise factor which is an additional term in the formulation. We argue that $z_{D+1}$ can guide an explicit assignment of the attribute-irrelated components, enhancing the discriminability of the attribute-related.

3.2.2 Location-to-attribute Embedding

After attribute-to-location assignment, attribute components can be extracted from each spatial location explicitly via assigning magnitudes. Then, the attribute components from all locations are integrated to attribute embeddings via location-to-attribute embedding as follows:

g_{j}=\sum_{i=1}^{M}\alpha_{i,j}f_{i}+\bar{f},

(7)

where $g_{j}\in\mathbb{R}^{C}$ is the $j$ -th attribute embedding. Furthermore, the entire attribute embeddings $g$ can be simplified with matrix operations as follows:

g=\bm{\mathrm{A}}^{\mathsf{T}}\!f^{\prime}+[\,\bar{f}\,]_{(D+1)\times C,}

(8)

where $[\,\bar{f}\,]_{(D+1)\times C}\in\mathbb{R}^{(D+1)\times C}$ denotes a matrix whose entities are $\bar{f}$ . To formulate the attribute embeddings generally, we preserve the noise embedding as the $(D+1)$ -th embedding. Finally, the two operations of AEM are summarized as the pseudo code in Algorithm 1.

Algorithm 1 Pseudo code of assignment-embedding module (AEM) in a PyTorch-like style.

⬇

# In: Feature map f.

# Out: Attribute embeddings g.

class AEM(nn.Module):

def __init__(self, D, C):

self.z = nn.Parameter(torch.rand((D+1, C)))

def forward(self, f):

# To simplification, the dimension of batchsize is omitted.

W, H, C = f.size()

# f is flattened as [M, C].

f = f.view(W*H, C)

# Attribute-to-location assignment.

norm_z, norm_f = F.normalize(self.z), F.normalize(f)

A = torch.exp(norm_f @ norm_z.t())

A /= torch.sum(A)

# Location-to-attribute embedding.

g = A.t() @ f + torch.mean(f)

return g

3.3 Correlation Matrix Minimization

Since the high correlation among latent factors would weaken the distinctions of assignment magnitudes in $A$ , we introduce a regularization, termed as correlation matrix minimization (CMM), to reduce the correlation among $z$ . The correlation matrix of $z$ can be obtained as follows:

\bm{\mathrm{H}}=\left[\begin{array}[]{ccc}s(z_{1},z_{1})&\cdots&s(z_{1},z_{D+1})\\ \vdots&\ddots&\vdots\\ s(z_{D+1},z_{1})&\cdots&s(z_{D+1},z_{D+1})\end{array}\right],

(9)

where $s$ is presented in Equation 6. Then, CMM can be formulated as follows:

\mathcal{L}_{cmm}=\sum_{i=1}^{D+1}(1-s(z_{i},z_{i}))^{2}+\sum_{i=1}^{D+1}\sum_{j=1,i\neq j}^{D+1}\!\!s^{2}(z_{i},z_{j}),

(10)

where the first term can be omitted obviously, since $s(z_{i},z_{i})\triangleq 1$ , and the second term enforces the off-diagonal elements of correlation matrix to 0, decorrelating the latent factors explicitly.

3.4 Classification and Loss Function

For each attribute embedding, we cast the attribute predictions as multiple binary classification tasks. The $j$ -th representation of attribute embedding is projected to a logit value linearly, and then a sigmoid function $\sigma$ is utilized to convert the logit value as a probability $p_{j}$ , which can be formulated as follows:

p_{j}=\sigma(w^{\mathsf{T}}_{P_{j}}g_{j}+b_{P_{j}}),

(11)

where $w_{P_{j}}$ and $b_{P_{j}}$ refer to the weight vector and bias, respectively. The final prediction $p$ is generated via concatenating the probabilities of attribute embeddings as follows:

p=\big{[}p_{1},p_{2},\dots,p_{D},p_{D+1}\big{]},

(12)

where $p\in\mathbb{H}^{D+1}$ is the prediction of $D$ attributes and noise, where $\mathbb{H}$ represents Hamming space.

The loss function of the classification is cross entropy loss which can be formulated as follows:

\mathcal{L}_{cls}=-\frac{1}{D+1}\sum^{D+1}_{j=1}y_{j}\log p_{j},

(13)

where $y_{j}$ is the label of the $j$ -th attributes. $y_{D+1}$ is set to 0, since the label of noise is not available. The final loss function can be defined as follows:

\mathcal{L}=\mathcal{L}_{cls}+\gamma\mathcal{L}_{cmm},

(14)

where $\gamma$ is a hyper-parameter to balance the importance between the classification and the constraint of latent factors.

4 Experiment

4.1 Datasets

We conduct experiments on two public facial attribute datasets including CelebA [9] and LFWA [9], which is widely used to evaluate the method of FAR.

CelebA is a large-scale facial attribute dataset containing 202,599 facial images divided into 3 subsets in terms of training, validation, and testing. The numbers of facial images for 3 subsets are 162,770, 19,867, and 19,962, respectively. The facial images are annotated with 40 attribute labels.

LFWA is another popular facial attribute dataset composed of 13,143 facial images with 6,263 for training, 2,800 for validation, and 4,080 for testing. The facial images are also annotated with the same attribute labels as CelebA.

In the experiments, we follow the protocol of CelebA and LFWA that the default training set is used to train our method, while the performance of our method is evaluated on the default testing set.

4.2 Implementation

The experiments are conducted on a work station with NVIDIA RTX 2080Ti GPUs. The proposed method is implemented based on PyTorch deep learning framework. The backbone of feature extractor is ResNet50 [27] pre-trained on ImageNet [28], whose capability of feature extraction has been verified in many computer vision tasks [29, 30, 31, 32]. Following [17, 21], the latent factors $z$ are initialized with the distribution of $\mathcal{N}(0,I)$ . In the training phase, Adam [33] is used to optimize the learnable parameters with a weight decay of 5e-4. The total epochs are 80 and the initial learning rate is 3e-4, while the learning rate is reduced with the decay ratio of 0.1 after every 20 epochs. The value of $\gamma$ is set to 2e-2 empirically. The input images are scaled as $224\times 224$ and random flip is used as data argumentation. Particularly, the sizes of batches for training on CelebA and LFWA are 64 and 32, respectively.

4.3 Ablation studies

To evaluate the effectiveness of the proposed method comprehensively, we design the ablation studies in terms of ASD, AEM, and CMM.

4.3.1 The effectiveness of ASD for different feature extractors

Table 1: Ablation analysis of ASD for the different feature extractors. “w/ ASD” represents the method constructed with attribute spatial decomposition.

Backbone	w/ ASD	CelebA (%)	LFWA (%)
ResNet18		90.93	85.75
ResNet18	✓	91.67	86.68
ResNet50		91.39	86.47
ResNet50	✓	92.22	87.43
ResNeXt50		91.63	86.78
ResNeXt50	✓	92.21	87.44

To clarify the effectiveness of ASD, we conduct the ablation experiments based on the different feature extractors. Three popular CNNs for FAR are introduced, including ResNet18 [27], ResNet50 [27], and ResNeXt50 [34]. Specifically, two versions of FAR method for corresponding feature extractors are constructed. The first version of the method is implemented without ASD, in which the global pooled feature is directly used as the input of the fully connected layer-based classifier. In contrast, the second version of method is designed with ASD. The results of the different feature extractors-based methods in terms of average accuracy are reported in Table 1. We observe that ASD improves the performances of FAR obviously. For CelebA, the average accuracies of the methods based on various backbones are improved about 0.7%, 0.9%, and 0.6%, respectively. For LFWA, the improvements caused via ASD are 0.9%, 0.9%, and 0.7%, respectively. The experimental results demonstrate the effectiveness of ASD whose superiority is unaffected by the different feature extractors. Moreover, to interpret ASD intuitively, we visualize the assignment matrix $A$ which is the key idea of ASD, as shown in Figure 4. The heatmaps highlight the corresponding attribute-related regions evidently. It argues that the attribute components can be described via ASD in the spatial dimension. The spatial ambiguity of facial attributes is alleviated, thereby improving the performance of FAR effectively. To consider the trade-off between efficiency and performance, we implement ASD based on ResNet50 in the subsequent experiments.

Table 2: Ablation analysis of

z_{D+1}

and

\bar{f}

. “w/o

z_{D+1}

” and “w/o

\bar{f}

” denote AEM without latent noise factor and mean of feature, respectively.

w/o $z_{D+1}$	w/o $\bar{f}$	CelebA (%)	LFWA (%)
✓	✓	91.76	86.98
	✓	91.91	87.23
✓		91.93	87.22

Table 3: Ablation analysis of CMM. “w/ CMM” denotes AEM correlation matrix minimization.

w/ CMM	CelebA (%)	LFWA (%)
	92.03	87.27
✓	92.22	87.43

4.3.2 Latent noise factor and mean of feature in AEM

To investigate the impact of latent noise factor $z_{D+1}$ and mean of feature $\bar{f}$ , we design the methods based on ResNet50 without $z_{D+1}$ and $\bar{f}$ , separately. Meanwhile, the method without both $z_{D+1}$ and $\bar{f}$ is constructed, termed as vanilla AEM. As shown in Table 2, the performance of the method including $z_{D+1}$ is higher than the vanilla AEM, while the formulation of $\bar{f}$ in the AEM also improves the overall performance.

4.3.3 The effectiveness of CMM

To validate the effectiveness of CMM, we further train the method based on ResNet50 without CMM. The quantitative results are reported in Table 3. The method with CMM achieves a higher accuracy that about 0.2% increases on both CelebA and LFWA. It argues that the decorrelation among latent factors can enhance the discriminability of $g$ to improve the overall performance of ASD.

4.4 Comparison with State-of-the-art Methods

Table 4: Comparison of our method with the state-of-the-art methods on CelebA and LFWA. “w/ EP” and “w/ IP” denote the method with explicit prior information and implicit prior information, respectively. The best and the second best performances of average accuracy (%) are marked with red and blue, respectively. “-” represents the result is not provided by corresponding paper.

Method	w/ EP	w/ IP	CelebA (%)	LFWA (%)
AFFAIR [10]	✓		91.45	86.13
PS-MCNN [11]	✓	✓	92.98	87.36
HSA [12]	✓		91.81	85.20
DMM [13]	✓	✓	91.70	86.56
SlimCNN [14]		✓	91.24	76.02
SSPL [15]	✓		91.77	86.53
MGG-Net [16]		✓	92.00	87.20
HFE [35]		✓	91.24	-
ASD (Ours)			92.22	87.43
CSN [26]			91.80	-
TResNetM [36]			91.72	86.67
ASD (Ours)			92.22	87.43

We compare our method with 10 state-of-the-art methods lately reported on CelebA and LFWA, which can be categorized into as prior-based and prior-free methods. The prior-based methods include ARRAIR [10], PS-MCNN [11], HSA [12], DMM [13], SlimCNN [14], SSPL [15], MGG-Net [16], and HFE[35], where prior information, such as landmarks, face parsing masks, and attribute prior embeddings, are utilized in the training phase. CSN [26] and TResNetM [36] are general prior-free methods for multi-label classification task. Here, we implement TResNetM on CelebA and LFWA with the training scheme provided in [36], such as optimizer, learning rate, and training epochs. Besides [36], we list the experimental results of other methods reported from corresponding papers, as shown in Table 4.

For the comparison with prior-based methods, ASD achieves the competitive performances than state-of-the-art methods on both CelebA and LFWA. It can be observed that there is still a performance gap between ASD and PS-MCNN. However, we argue the superiority of ASD is obvious. First, the architecture of ASD is simpler. PS-MCNN consists of five CNN-based networks whose initializations are pre-trained on face recognition tasks, separately. Such stage-based training procedure inevitably increases the complicacy of the implementation. In contrast, ASD is composed of a single CNN-based network and AEM, which can be trained via an end-to-end paradigm. The another advantage of ASD is that the performance of ASD is not relied on extra prior information. In [11], the performance of PS-MCNN without identified information and prior attribute groups would be deteriorated, where the average accuracy of PS-MCNN on CelebA is reduced from 92.98% to 91.15%.

For the comparison with prior-free methods, the performance of ASD is superior than CSN and TResNetM. The reason for the low performance of TResNetM might be that the larger batches should be conducted, like the setting in [36]. To come up with a fair comparison, we gradually adopt the various sizes of batches for TResNetM trained on CelebA. The experimental results are listed in Table 5. It can be seen that the improvements of TResNetM caused by increasing the size of batches are slight, while there is a significant growth of GPU memory costs in the training phase. Here, ASD achieves a good trade-off between average accuracy and computational burdens, implying the architecture of ASD without bells and whistles.

Table 5: Comparison of our method with TResNetM for various sizes of batches on CelebA.

Method	batches	GPU memory	CelebA (%)
TResNetM	64	4,200 MiB	91.72
TResNetM	128	6,300 MiB	91.97
TResNetM	256	12,100 MiB	92.23
ASD (Ours)	64	5,600 MiB	92.22

Table 6: Performance with limited training data on CelebA. 0.2%, 0.5%, 1%, and 2% are the different ratios of the limited training sets including 325, 832, 1,627, and 3,255 training samples, respectively. The best performance of average accuracy (%) is marked with red.

Method	CelebA (%)
ratios	0.2%	0.5%	1%	2%
numbers of training samples	325	832	1627	3255
DeepCluster [37]	83.21	86.13	87.46	88.86
JigsawPuzzle [38]	82.88	84.71	86.25	87.77
Rot [39]	83.25	86.51	87.67	88.82
FixMatch [40]	80.22	84.19	85.77	86.14
VAT [41]	81.44	84.02	86.30	87.28
SSPL [15]	86.67	88.05	88.84	89.58
ASD (Ours)	87.82	89.19	89.90	90.23

4.5 Performance with Limited Training Data

The limited training data case is a latest challenge for FAR, where only a small ratios of training samples can be utilized in the training phase. To investigate the performance of ASD with limited training data, ASD is compared with other methods including DeepCluster [37], JigsawPuzzle [38], Rotation [39], FixMatch [40], VAT [41], and SSPL [15].

Following [15], we select the samples randomly from the training set with various small ratios (0.1%, 0.2%, 1%, and 2%), constructing the limited training set for ASD. The experimental results of each ratio are the mean of 10 times, while the limited training sets are repartitioned in each trial. The performances of ASD with the different ratios of limited training sets are reported in Table 6. The experimental results show the the superior performance of ASD comparing with other methods, revealing the reliable capability of ASD with the limited training data case for FAR.

5 Conclusion

In this paper, we propose a novel prior-free method for facial attribute recognition (FAR), termed as attribute spatial decomposition (ASD), mitigating the spatial ambiguity of facial attributes formally without any extra prior information. Assignment-embedding module (AEM) is modeled to enable ASD via latent factors, while correlation matrix minimization (CMM) is introduced to improve the discriminability of decomposed attribute embeddings. Experimental results demonstrate that ASD achieves the competitive performances without any extra prior information compared with state-of-the-art prior-based methods. Furthermore, the reliable performance of ASD for the case of limited training data is verified.

References

[1] N. Kumar, A. Berg, P. N. Belhumeur, and S. Nayar, “Describable visual attributes for face verification and image search,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 10, pp. 1962–1977, 2011.
[2] X. Di, B. S. Riggan, S. Hu, N. J. Short, and V. M. Patel, “Multi-scale thermal to visible face verification via attribute guided synthesis,” IEEE Transactions on Biometrics, Behavior, and Identity Science, vol. 3, no. 2, pp. 266–280, 2021.
[3] Y. Li, T. Zhang, and C. L. P. Chen, “Enhanced broad siamese network for facial emotion recognition in human–robot interaction,” IEEE Transactions on Artificial Intelligence, vol. 2, no. 5, pp. 413–423, 2021.
[4] R. Wadhawan and T. Gandhi, “Landmark-aware and part-based ensemble transfer learning network for static facial expression recognition from images,” IEEE Transactions on Artificial Intelligence, pp. 1–1, 2022.
[5] X. Yan, J. Yang, K. Sohn, and H. Lee, “Attribute2image: Conditional image generation from visual attributes,” in European conference on computer vision. Springer, 2016, pp. 776–791.
[6] Y. Liu, Q. Sun, X. He, A.-A. Liu, Y. Su, and T.-S. Chua, “Generating face images with attributes for free,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 6, pp. 2733–2743, 2021.
[7] B.-C. Chen, Y.-Y. Chen, Y.-H. Kuo, and W. H. Hsu, “Scalable face image retrieval using attribute-enhanced sparse codewords,” IEEE Transactions on Multimedia, vol. 15, no. 5, pp. 1163–1173, 2013.
[8] A. Zaeemzadeh, S. Ghadar, B. Faieta, Z. Lin, N. Rahnavard, M. Shah, and R. Kalarot, “Face image retrieval with attribute manipulation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 12 116–12 125.
[9] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the wild,” in 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 3730–3738.
[10] J. Li, F. Zhao, J. Feng, S. Roy, S. Yan, and T. Sim, “Landmark free face attribute prediction,” IEEE Transactions on Image Processing, vol. 27, no. 9, pp. 4651–4662, 2018.
[11] J. Cao, Y. Li, and Z. Zhang, “Partially shared multi-task convolutional neural network with local constraint for face attribute learning,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 4290–4299.
[12] K. He, Y. Fu, W. Zhang, C. Wang, Y.-G. Jiang, F. Huang, and X. Xue, “Harnessing synthesized abstraction images to improve facial attribute recognition,” in Proceedings of the 27th International Joint Conference on Artificial Intelligence, 2018, p. 733–740.
[13] L. Mao, Y. Yan, J.-H. Xue, and H. Wang, “Deep multi-task multi-label cnn for effective facial attribute classification,” IEEE Transactions on Affective Computing, 2020.
[14] A. K. Sharma and H. Foroosh, “Slim-cnn: A light-weight cnn for face attribute prediction,” in 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), 2020, pp. 329–335.
[15] Y. Shu, Y. Yan, S. Chen, J.-H. Xue, C. Shen, and H. Wang, “Learning spatial-semantic relationship for facial attribute recognition with limited labeled data,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 11 911–11 920.
[16] Z. Chen, S. Gu, F. Zhu, J. Xu, and R. Zhao, “Improving facial attribute recognition by group and graph learning,” in 2021 IEEE International Conference on Multimedia and Expo (ICME), 2021, pp. 1–6.
[17] D. Gong, Z. Li, D. Lin, J. Liu, and X. Tang, “Hidden factor analysis for age invariant face recognition,” in 2013 IEEE International Conference on Computer Vision, 2013, pp. 2872–2879.
[18] E. M. Rudd, M. Günther, and T. E. Boult, “Moon: A mixed objective optimization network for the recognition of facial attributes,” in Computer Vision – ECCV 2016, 2016, pp. 19–35.
[19] E. M. Hand and R. Chellappa, “Attributes for improved attributes: A multi-task network utilizing implicit and explicit relationships for facial attribute classification,” in Thirty-First AAAI Conference on Artificial Intelligence, 2017, pp. 4068–4074.
[20] H. Han, A. K. Jain, F. Wang, S. Shan, and X. Chen, “Heterogeneous face attribute estimation: A deep multi-task learning approach,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 11, pp. 2597–2609, 2018.
[21] Y. Wen, Z. Li, and Y. Qiao, “Latent factor guided convolutional neural networks for age-invariant face recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4893–4901.
[22] Y. Li, J. Zeng, S. Shan, and X. Chen, “Self-supervised representation learning from videos for facial action unit detection,” in Proceedings of the IEEE/CVF Conference on Computer vision and pattern recognition, 2019, pp. 10 924–10 933.
[23] Y.-C. Chen, X. Shen, Z. Lin, X. Lu, I. Pao, J. Jia et al., “Semantic component decomposition for face attribute manipulation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 9859–9867.
[24] Y. Alharbi and P. Wonka, “Disentangled image generation through structured noise injection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 5134–5142.
[25] Y. Cheng, Y. Bao, and F. Lu, “Puregaze: Purifying gaze feature for generalizable gaze estimation,” arXiv preprint arXiv:2103.13173, 2021. [Online]. Available: https://arxiv.org/abs/2103.13173
[26] X. Zhao, Y. Yang, F. Zhou, X. Tan, Y. Yuan, Y. Bao, and Y. Wu, “Recognizing part attributes with insufficient data,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 350–360.
[27] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016, pp. 770–778.
[28] J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, June 2009, pp. 248–255.
[29] C. Hu and Y. Wang, “An efficient convolutional neural network model based on object-level attention mechanism for casting defect detection on radiography images,” IEEE Transactions on Industrial Electronics, vol. 67, no. 12, pp. 10 922–10 930, 2020.
[30] A. Basu, S. S. Mullick, S. Das, and S. Das, “Do pre-processing and class imbalance matter to the deep image classifiers for covid-19 detection an explainable analysis,” IEEE Transactions on Artificial Intelligence, pp. 1–1, 2022.
[31] C. d. Vente, L. H. Boulogne, K. V. Venkadesh, C. Sital, N. Lessmann, C. Jacobs, C. I. Sánchez, and B. v. Ginneken, “Automated covid-19 grading with convolutional neural networks in computed tomography scans: A systematic comparison,” IEEE Transactions on Artificial Intelligence, vol. 3, no. 2, pp. 129–138, 2022.
[32] C. Zhang, Y. Tao, K. Du, W. Ding, B. Wang, J. Liu, and W. Wang, “Character-level street view text spotting based on deep multisegmentation network for smarter autonomous driving,” IEEE Transactions on Artificial Intelligence, vol. 3, no. 2, pp. 297–308, 2022.
[33] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR (Poster), 2015. [Online]. Available: http://arxiv.org/abs/1412.6980
[34] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1492–1500.
[35] J. Yang, J. Fan, Y. Wang, Y. Wang, W. Gan, L. Liu, and W. Wu, “Hierarchical feature embedding for attribute recognition,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 13 052–13 061.
[36] T. Ridnik, H. Lawen, A. Noy, E. Ben, B. G. Sharir, and I. Friedman, “Tresnet: High performance gpu-dedicated architecture,” in 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), 2021, pp. 1399–1408.
[37] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 132–149.
[38] M. Noroozi and P. Favaro, “Unsupervised learning of visual representations by solving jigsaw puzzles,” in Computer Vision – ECCV 2016, 2016, pp. 69–84.
[39] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,” in International Conference on Learning Representations 2018, 2018. [Online]. Available: https://openreview.net/forum?id=S1v4N2l0-
[40] K. Sohn, D. Berthelot, N. Carlini, Z. Zhang, H. Zhang, C. A. Raffel, E. D. Cubuk, A. Kurakin, and C.-L. Li, “Fixmatch: Simplifying semi-supervised learning with consistency and confidence,” Advances in Neural Information Processing Systems, vol. 33, pp. 596–608, 2020.
[41] T. Miyato, S.-i. Maeda, M. Koyama, and S. Ishii, “Virtual adversarial training: a regularization method for supervised and semi-supervised learning,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 8, pp. 1979–1993, 2018.

{IEEEbiography}

[ [Uncaptioned image] ] Chuanfei Hu received the M.S. degree from the School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai, China. He is currently pursuing the Ph.D. degree with the School of Automation, Southeast University, Nanjing, China, in 2021. His research interests include deep learning, attribute learning, and affective computing.

{IEEEbiography}

[ [Uncaptioned image] ] Hang Shao received the M.S. degree from the School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai, China. He is currently pursuing the Ph.D. degree with School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China, in 2021. His research interests include deep learning and representation learning.

{IEEEbiography}

[ [Uncaptioned image] ] Bo Dong received the B.S. degree from the School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai, China. He is currently pursuing the M.S. degree in biomedical engineering with Zhejiang University, Zhejiang, China, in 2021. His main research interests include computer version and medical image analysis.

{IEEEbiography}

[ [Uncaptioned image] ] Zhe Wang received the M.S. degree from the School of Electrical and Electronic Engineering, Tianjin University of Technology, Tianjin, China. He is currently pursuing the Ph.D. degree with the School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai, China, in 2020. His research interests include affective computing, signal processing, and brain-computer interface.

{IEEEbiography}

[ [Uncaptioned image] ] Yongxiong Wang received the B.S. degree in engineering mechanics from Harbin Engineering University, Harbin, China, and the M.S. and Ph.D. degrees in control science and engineering from Shanghai Jiao Tong University, Shanghai, China, in 1991. He is currently a Professor with the School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai. His research interests include computer vision, affective computing, and intelligent robot.