This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Unsupervised Learning of Debiased Representations with Pseudo-Attributes

Seonguk Seo1   Joon-Young Lee3   Bohyung Han1,2
1ECE & 1ASRI & 1,2IPAI, Seoul National University    3Adobe Research
{seonguk, bhhan}@snu.ac.kr  jolee@adobe.com
Abstract

Dataset bias is a critical challenge in machine learning since it often leads to a negative impact on a model due to the unintended decision rules captured by spurious correlations. Although existing works often handle this issue based on human supervision, the availability of the proper annotations is impractical and even unrealistic. To better tackle the limitation, we propose a simple but effective unsupervised debiasing technique. Specifically, we first identify pseudo-attributes based on the results from clustering performed in the feature embedding space even without an explicit bias attribute supervision. Then, we employ a novel cluster-wise reweighting scheme to learn debiased representation; the proposed method prevents minority groups from being discounted for minimizing the overall loss, which is desirable for worst-case generalization. The extensive experiments demonstrate the outstanding performance of our approach on multiple standard benchmarks, even achieving the competitive accuracy to the supervised counterpart. The source code is available at our project page111https://github.com/skynbe/pseudo-attributes.

1 Introduction

Refer to caption
Figure 1: Representative examples on the CelebA dataset for the problem that we focus on. Since most of the people with blond hair are women, hair color attribute has a spurious correlation with gender attribute. Thus, when trained to classify the hair color, a network captures the unintended decision rule using gender, leading to poor worst-group and unbiased accuracies, despite its high overall accuracy. Our model aims to learn debiased representation, which gives better worst-group and unbiased accuracies, especially when the bias information is unavailable.

Deep neural networks have achieved impressive performance by minimizing the average loss on training datasets. Although we typically adopt the empirical risk minimization framework as a training objective, it is sometimes problematic due to dataset bias leading to significant degradation of worse-case generalization performance as discussed in [2, 37, 18, 12, 38]. This is because models do not always learn what we expect, but, to the contrary, rather capture unintended decision rules from spurious correlations. For example, on the Colored MNIST dataset [21, 25, 1], where each digit is highly correlated to a certain color, a network often learns the color patterns in images, not the digit information, while ignoring few conflicting samples. Such an unintended rule works well on most of the training samples but incurs unexpected worst-case errors for the examples in minority groups, which makes the model unable to generalize on test environments with distribution shifts or robustness constraints. Figure 1 illustrates the problem that we mainly deal with in this paper.

To mitigate the bias issue, learning debiased representations from a biased dataset has received growing attention from machine learning community [21, 34, 26, 28, 14, 1, 3]. However, most previous works rely on the explicit supervision or prior knowledge under the assumption of the presence of dataset bias. This setting is problematic because it is challenging to identify what kinds of bias exist and which attributes involve spurious correlations without thorough analysis of model and dataset. Note that, even with the information about dataset bias, the relevant annotations over all training examples is challenging. Contrary to the supervised approaches, [25, 30] tackle a more challenging setting, where the bias information is unavailable, via failure-based learning or subgroup-based penalizing.

This paper presents a simple but effective unsupervised debiasing technique via feature clustering and cluster re-weighting. We first observe that the examples with the same label for a certain attribute other than the target attribute tend to have similar representations in the feature space by the model trained sufficiently. Based on this observation, we estimate bias pseudo-attributes in an unsupervised manner from the clustering results within each class. To exploit the bias pseudo-attributes for learning debiased representations, we introduce a reweighting scheme for the corresponding clusters, where each cluster has an importance weight depending on its size and task-specific accuracy. This strategy encourages the minority clusters to participate in the optimization process actively, which is critical to improving worst-group generalization. Despite its simplicity, our method turns out to be effective for debiasing without any supervision of bias information; it is even comparable to the supervised debiasing method. The main contributions of our work are summarized below:

  • \bullet

    We propose a simple but effective unsupervised debiasing approach, which requires no explicit supervision about spurious correlations across attributes.

  • \bullet

    We introduce a technique to learn debiased representations by identifying bias pseudo-attributes via clustering and reweighting the corresponding clusters based on both their size and target loss.

  • \bullet

    We provide extensive experimental results and achieve outstanding performance in terms of unbiased and worst-group accuracies, which are even as competitive as supervised debiasing methods.

The rest of this paper is organized as follows. We review the prior research in Section 2. Section 3 presents the proposed framework for learning debiased representations, and Section 4 demonstrates its effectiveness with extensive empirical analysis. We conclude our paper in Section 5.

2 Related Work

2.1 Bias in computer vision tasks

Real-world datasets are inevitably biased due to their insufficiently controlled collection process, and consequently, deep neural networks often capture unintended correlations between the true labels and the spuriously correlated ones. Measuring and mitigating the potential risks posed by dataset or algorithmic bias has been extensively investigated in various computer vision tasks [6, 31, 21, 16, 33, 3, 36]. For example, VQA models frequently exploits statistical regularities between answer occurrences and patterns of questions while ignoring visual information [3, 7]. Semantic segmentation models typically take advantage of scene context for pixel-level predictions of semantic labels [6]. To prevent using undesirable correlation in biased datasets, existing approaches often rely on human supervision for bias annotations and present several technical directions such as data augmentation [13, 39], model ensembles [3, 7], and statistical regularizations [1]. Such supervised debiasing techniques have been applied to various computer vision tasks by exploiting known application-specific bias information, including uni-modality of the dataset in visual question answering [3], stereotype textures in image recognition [13], temporal invariance in action recognition [20], and demographic information in facial image recognition [26, 39, 35].

2.2 Handling distribution shifts

Distribution shift has recently emerged as a critical challenge in machine learning, where the goal of the optimization is to learn a robust model in a test environment with a different distribution. Distributionally robust optimization (DRO) [2, 11, 9, 17] has been proposed to improve the worst-case generalization performance over a family of target distributions, and has provided theoretical background of Group DRO [26] and its variation [30]. However, the objective of DRO often leads to an overly conservative model and results in performance degeneration on unseen environments [10, 15]. To relax the constraints for the uncertainty set of test distributions, some approaches pose additional assumptions. For instance, the dataset consists of multiple groups with the shared properties and the uncertainty set is represented by a mixture of these groups. This assumption is also used in robust federated learning [24, 19], algorithmic fairness [37, 5, 8], and domain generalization [18, 4, 29]. Our framework also takes advantage of this assumption but does not rely on the supervision of group information.

2.3 Debiasing via loss-based reweighting

There exist several generic debiasing techniques via sample reweighting based on observed task-specific losses under the supervised environment [26] or the unsupervised setting [25, 30]. Group DRO [26] exploits the group information specified by the bias attributes and aims to improve the worst-group generalization performance. On the other hand, Nam et al. [25] employ the difference between the generalized and standard cross-entropy loss to capture the bias-alignment for sample reweighting while Sohoni et al. [30] estimate subclass labels via clustering and utilize the information for distributionally robust optimization to mitigate hidden stratification. Although the unsupervised approaches work well in small and artificial datasets such as MNIST, their performance improvement becomes marginal in real-world datasets including CelebA. Our framework also belongs to unsupervised methods that do not rely on the bias information to learn debiased representations.

3 Method

This section presents our debiasing technique via bias pseudo-attribute estimation and sample reweighting.

3.1 Preliminaries

Let an example 𝐱\mathbf{x} be associated with a set of mm attributes 𝒜:={a1,,am}\mathcal{A}:=\{a_{1},...,a_{m}\}. The goal of our model is to predict a target attribute at𝒜a_{t}\in\mathcal{A} by estimating the intended causation p(at|𝐱)p(a_{t}|\mathbf{x}), which does not involve any undesirable correlation to other latent attributes, i.e., p(at|𝐱)=p(at|𝐱,ai)p(a_{t}|\mathbf{x})=p(a_{t}|\mathbf{x},a_{i}), ai𝒜{at}\forall a_{i}\in\mathcal{A}-\{a_{t}\}. On the other hand, spurious correlation indicates strong coincidence between two attributes ai,aj𝒜a_{i},a_{j}\in\mathcal{A}; the conditional entropy H(ai|aj)H(a_{i}|a_{j}) is close to zero and there exists no causal relationship between them. A machine learning algorithm is considered biased if a certain attribute ab𝒜a_{b}\in\mathcal{A} has a spurious correlation with the target attribute ata_{t} and affects the prediction, i.e., p(at|𝐱)p(at|𝐱,ab)p(a_{t}|\mathbf{x})\neq p(a_{t}|\mathbf{x},a_{b}). Our approach performs debiasing by estimating groups in the dataset without supervision, where the group is defined by a pair of target and bias attributes, e.g., g=(at,ab)g=(a_{t},a_{b}).

3.2 Observation

If a bias attribute is highly correlated to a target attribute while being easy to learn, the model may ignore few conflicting examples and learn its decision rule based on the bias attributes with spurious correlations to maximize accuracy [25, 27]. To prevent this undesirable situation, a simple group upweighting or resampling strategies [27] are known to be effective while they work poorly in realistic scenarios, where the bias information is unknown during training.

To overcome this challenge, from our intuition, we analyze the feature semantics over the target and bias attributes. We first naïvely train a base model on the CelebA dataset to classify hair color, and visualize the representation of the examples after convergence with a sufficient number of epochs (T=100T=100). We select gender as a bias attribute, but do not utilize any information of the bias attribute during training. It turns out that, even without using the bias information during training, the examples drawn from certain groups, which are given by a combination of hair color and gender attribute values in this case, e.g., (male, non-blonde) and (female, blonde), are located closely in the feature space. This observation implies that it is possible to identify bias pseudo-attributes by taking advantage of the embedding results even without attribute supervisions. Our unsupervised debiasing framework is based on the capability to identify the bias pseudo-attributes via clustering.

3.3 Formulation

Suppose that training examples, (𝐱,y)(\mathbf{x},y), are drawn from a certain distribution P~\tilde{P}. Given a loss function ()\ell(\cdot) and model parameters θ\theta, the objective of the empirical risk minimization (ERM) is to optimize the following expected loss:

minθ𝔼(𝐱,y)P[((𝐱,y);θ)],\displaystyle\min_{\theta}\ \mathbb{E}_{(\mathbf{x},y)\sim{P}}\big{[}\ell((\mathbf{x},y);\theta)\big{]}, (1)

where P{P} is the empirical distribution over training data that approximates the true distribution P~\tilde{P}. Although ERM generally works well, it tends to ignore examples in minority groups that conflict with bias attributes and implicitly assumes the consistency of the underlying distributions for training and testing data. Consequently, the approach often leads to high unbiased and worst-group test error [9, 26].

Several distributionally robust optimization (DRO) techniques [2, 9, 17] can be employed to tackle the dataset bias and distribution shift problems and maximize unbiased generalization accuracy. They consider a particular uncertainty set 𝒬P\mathcal{Q}_{P}, which is close to the training distribution PP, e.g., 𝒬P={Q:Df[Q||P]δ}\mathcal{Q}_{P}=\{Q:D_{f}[Q||P]\leq\delta\}, where Df[||]D_{f}[\cdot||\cdot] indicates an ff-divergence function222Let PP and QQ be probability distributions over a space Ω\Omega, then f-divergence is Df(P||Q)=Ωf(dPdQ)dQD_{f}(P||Q)=\int_{\Omega}f\big{(}\frac{dP}{dQ}\big{)}dQ.. To minimize the worst-case loss over the uncertainty distribution set 𝒬P\mathcal{Q}_{P}, DRO optimizes

min𝜃{𝒬P(θ):=supQ𝒬P𝔼(𝐱,y)Q[((𝐱,y));θ]}.\displaystyle\underset{\theta}{\vphantom{\sup}\min}\Bigl{\{}\mathbb{R}_{{\mathcal{Q}_{P}}}(\theta):=\sup_{Q\in{\mathcal{Q}_{P}}}\mathbb{E}_{(\mathbf{x},y)\sim Q}\big{[}\ell((\mathbf{x},y));\theta\big{]}\Bigr{\}}. (2)

However, this objective is overly pessimistic and makes the model consider the implausible worst cases [10, 15].

The group distributionally robust optimization, referred to as group DRO [26] creates more realistic sets of possible test distributions by leveraging the prior knowledge of group information. They assumes the training distribution PP is a mixture of GG groups, PgP_{g}, which is given by

P𝒢=g𝒢cgPg,cΔGP_{\mathcal{G}}=\sum_{g\in\mathcal{G}}c_{g}{P_{g}},~{}~{}~{}c\in\Delta_{G} (3)

where 𝒢={1,,G}\mathcal{G}=\{1,...,G\} and ΔG\Delta_{G} is a (G1)(G-1)-dimensional simplex. Then the uncertainty set 𝒬P\mathcal{Q}_{P} is defined by a set of all possible mixtures of these groups, i.e., 𝒬P𝒢={g𝒢cgPg:cΔG}\mathcal{Q}_{P_{\mathcal{G}}}=\{\sum_{g\in\mathcal{G}}c_{g}P_{g}:c\in\Delta_{G}\}. Because 𝒬P𝒢\mathcal{Q}_{P_{\mathcal{G}}} is a simplex, its optimum is achieved at a vertex, thus minimizing the worst-case risk of 𝒬P𝒢\mathcal{Q}_{P_{\mathcal{G}}} is equivalent to

min𝜃{𝒢(θ):=maxg𝒢𝔼(𝐱,y)Pg[((𝐱,y);θ)]}.\displaystyle\underset{\theta}{\vphantom{\sup}\min}\Bigl{\{}{\mathbb{R}_{\mathcal{G}}}(\theta):=\underset{g\in\mathcal{G}}{\vphantom{\operatorname*{arg\,min}}\max}\ \mathbb{E}_{(\mathbf{x},y)\sim P_{g}}\big{[}\ell((\mathbf{x},y);\theta)\big{]}\Bigr{\}}. (4)

Different from the group DRO setting, we do not know the group assignment for each training example. Instead, we use the bias pseudo-attribute information, obtained by any clustering algorithm in the feature embedding space, to define groups. Note that the clustering is performed with the representations given by the base model trained without debiasing, which is parameterized by θ~\tilde{\theta}. Our goal is to alleviate dataset bias and maximize unbiased accuracy, and we need to consider all groups fairly for optimization. To this end, we assign a proper importance weight, ωk\omega_{k}, to the kthk^{\text{th}} cluster, where k𝒦={1,,K}k\in\mathcal{K}=\{1,...,K\}, and the final objective of our framework is given by minimizing a weighted empirical risk as follows:

min𝜃{𝒦(θ):=𝔼(𝐱,y)P[ωh((𝐱,y);θ~)((𝐱,y);θ)]},\displaystyle\underset{\theta}{\vphantom{\sup}\min}\Bigl{\{}{\mathbb{R}_{\mathcal{K}}}(\theta):=\mathbb{E}_{(\mathbf{x},y)\sim P}\Big{[}\omega_{h((\mathbf{x},y);\tilde{\theta})}\ell((\mathbf{x},y);\theta)\Big{]}\Bigr{\}}, (5)

where h((𝐱,y);θ~)h((\mathbf{x},y);\tilde{\theta}) denotes the cluster membership of an example (𝐱,y)(\mathbf{x},y). The details of the weight assignment method will be discussed next.

1 Require: step size ηθ\eta_{\theta}, momentum mm, training steps TT, batch size BB, the number of clusters KK
2 Base model:
3 Initialize θ~\tilde{\theta}
4 for t=1,,Tt=1,...,T do
5       Sample (𝐱i,yi)P(\mathbf{x}_{i},y_{i})\sim P for i=1,,Bi=1,...,B;
6       θ~θ~ηθi=1B((𝐱i,yi);θ~)\tilde{\theta}\leftarrow\tilde{\theta}-\eta_{\theta}\sum_{i=1}^{B}\nabla\ell((\mathbf{x}_{i},y_{i});{\tilde{\theta}});
7      
8 end for
9 for k=1,,Kk=1,...,K do
10       Pk={(𝐱n,yn)|h((𝐱n,yn);θ~)=kfor alln}P_{k}=\{(\mathbf{x}_{n},y_{n})~{}|~{}h((\mathbf{x}_{n},y_{n});\tilde{\theta})=k~{}~{}\text{for all}~{}n\};
11       Nk=|Pk|N_{k}=|P_{k}|;
12      
13 end for
14 Target model:
15 Initialize θ\theta and ωk\omega_{k} for k=1,,Kk=1,...,K
16 for t=1,,Tt=1,...,T do
17       ωk(1m)ωk+mNk𝔼(𝐱,y)Pk[((𝐱,y);θ)]\omega_{k}\leftarrow(1-m)\omega_{k}+\frac{m}{N_{k}}{\mathbb{E}_{(\mathbf{x},y)\sim{P_{k}}}[\ell((\mathbf{x},y);{\theta})]}
18                                                        for k=1,,Kk=1,...,K;
19      
20      Sample (𝐱i,yi)P(\mathbf{x}_{i},y_{i})\sim P for i=1,,Bi=1,...,B;
21       αi=ωh((𝐱i,yi);θ~)\alpha_{i}=\omega_{h((\mathbf{x}_{i},y_{i});\tilde{\theta})};
22       α¯i=αi/i=1Bαi\overline{\alpha}_{i}={\alpha_{i}}/{\sum_{i=1}^{B}\alpha_{i}};
23       θθηθi=1Bα¯i((𝐱i,yi);θ)\theta\leftarrow\theta-\eta_{\theta}\sum_{i=1}^{B}\overline{\alpha}_{i}\nabla\ell((\mathbf{x}_{i},y_{i});{\theta});
24 end for
Algorithm 1 Debiasing with bias pseudo-attribute

3.4 Sample weighting with bias pseudo-attributes

Based on our observation described in Section 3.2, we first cluster training examples in each class on the feature embedding space defined by the base model optimized sufficiently, e.g., for 100 epochs using the standard classification loss. We suppose that each cluster corresponds to a bias pseudo-attribute. Among all clusters, we focus on the examples in the minority clusters, especially when they have high average losses. A common failure case in the presence of dataset bias is incurred by ignoring specific subpopulation groups for minimizing the overall training loss, and minority clusters are prone to be ignored due to their sizes. The problematic cases among the clusters are the ones that contain many bias-conflicting examples, having high losses, and thus result in poor worst-case errors. If the minority clusters consist of mostly bias-aligned samples, they will apparently achieve high classification accuracy.

Therefore, to handle dataset bias issue, we should consider both scale and average difficulty (loss) of each cluster, unlike group DRO [26] and George [30], which focus only on the average loss. We calculate the importance weight of each cluster by our reweighting scheme to train the target model, which is given by

ωk\displaystyle\omega_{k} =𝔼(𝐱,y)Pk[((𝐱,y);θ)]Nk\displaystyle=\frac{\mathbb{E}_{(\mathbf{x},y)\sim{P_{k}}}[\ell((\mathbf{x},y);{\theta})]}{{N_{k}}}
=𝔼(𝐱,y)P[((𝐱,y);θ)|h((𝐱,y);θ~)=k]i𝟙(h((𝐱i,yi);θ~)=k),\displaystyle=\frac{\mathbb{E}_{(\mathbf{x},y)\sim{P}}\big{[}\ell((\mathbf{x},y);{\theta})~{}|~{}{h((\mathbf{x},y);\tilde{\theta})=k}\big{]}}{\sum_{i}\mathds{1}(h((\mathbf{x}_{i},y_{i});\tilde{\theta})=k)}, (6)

where θ\theta and θ~\tilde{\theta} indicates the parameters of the final and base models, respectively, h(,)h(\cdot,\cdot) is a cluster membership function, and 𝟙()\mathds{1}(\cdot) is an indicator function. Note that Pk{P_{k}} denotes the sample distribution of the kthk^{\text{th}} cluster and NkN_{k} is the number of samples in the kthk^{\text{th}} cluster, where k𝒦={1,,K}k\in\mathcal{K}=\{1,...,K\}.

3.5 Algorithm procedure

Algorithm 1 presents the optimization procedure of the proposed framework. We first näively train a baseline network (line 4-7), parameterized by θ~\tilde{\theta}. Then, we cluster all training examples based on the features extracted from the network to obtain the membership distribution PkP_{k} and the size of cluster NkN_{k} (line 8-11). Based on the cluster assignments, we calculate the importance weight of each cluster ωk\omega_{k} using the target model, parameterized by θ\theta, where the weight is updated by exponential moving average at each iteration (line 15). We finally use the normalized importance weight of each sample α¯i\overline{\alpha}_{i} over a mini-batch to train the target model (line 18-19).

Table 1: Unbiased and worst-group results in the existence of spurious correlation between target and bias attributes on the test split of the CelebA dataset. LfF denotes a variant of LfF [25], which fine-tuned only the classification layer of a trained baseline model, for additional comparison to ours. Bold and underline fonts indicate the first and second place among the compared approaches, respectively. All experimental results are the average of thee runs.
Unbiased accuracy (%) Worst-group accuracy (%)
\cdashline3-12 Unsupervised Supervised Unsupervised Supervised
Target Bias Base LfF LfF BPA (ours) Group DRO Base LfF LfF BPA (ours) Group DRO
Blond Hair Gender 80.42 59.46 84.89 90.18 91.39 41.02 34.23 57.96 82.54 87.86
Heavy Makeup Gender 71.19 56.34 71.85 73.78 72.70 17.35 30.81 23.87 39.84 21.36
Pale Skin Gender 71.50 78.69 75.23 90.06 90.55 36.64 57.38 43.26 88.60 87.68
Wearing Lipstick Gender 73.90 53.79 73.84 78.28 78.26 31.38 25.52 31.92 46.52 46.08
Young Gender 78.19 45.99 79.58 82.27 82.40 52.79 0.34 57.79 74.33 76.29
Double Chin Gender 64.61 65.46 68.47 82.92 83.19 21.33 28.19 28.24 67.78 72.94
Chubby Gender 67.42 60.03 71.56 83.88 81.90 24.30 7.60 34.09 72.32 72.64
Wearing Hat Gender 93.53 84.56 94.81 96.80 96.84 85.12 69.06 88.31 94.94 94.67
Oval Face Gender 62.70 57.64 62.30 67.18 65.40 29.15 7.40 36.00 55.78 56.84
Pointy Nose Gender 62.10 42.20 63.83 68.90 70.71 25.80 1.05 38.04 52.48 63.76
Straight Hair Gender 70.28 39.57 72.84 79.18 77.04 47.82 1.95 58.53 72.09 66.10
Blurry Gender 73.05 76.70 77.52 88.93 87.05 45.68 43.81 52.35 84.10 82.06
Narrow Eyes Gender 63.18 68.53 67.77 76.39 76.72 27.01 31.81 38.53 73.24 71.47
Arched Eyebrows Gender 69.72 56.17 71.87 74.77 78.30 34.76 26.21 44.97 54.36 69.44
Bags Under Eyes Gender 69.47 44.61 71.86 77.84 75.88 41.65 0.06 49.10 62.55 63.34
Bangs Gender 89.04 41.41 89.04 93.94 94.45 76.91 3.18 82.37 92.21 92.12
Big Lips Gender 60.87 46.74 62.15 66.50 63.70 30.85 31.44 38.54 56.99 47.55
No Beard Gender 73.11 60.12 73.13 79.58 77.86 13.30 11.92 20.00 40.00 36.70
Receding Hairline Gender 69.72 70.57 74.58 84.95 85.15 35.69 32.10 45.53 79.11 79.12
Wavy Hair Gender 73.10 48.00 74.53 79.89 79.65 38.01 0.06 45.24 65.74 66.79
Wearing Earrings Gender 72.17 59.35 74.17 84.57 83.50 26.26 0.10 32.95 72.81 75.24
Wearing Necklace Gender 55.04 58.64 57.21 68.96 62.89 2.72 0.22 6.67 41.93 24.34
Average Gender 72.67 58.65 74.87 81.74 80.87 39.91 21.91 47.88 69.84 69.68

4 Experiments

Table 2: Unbiased and worst-group results on the Waterbirds dataset.
Unbiased accuracy (%) Worst-group accuracy (%)
\cdashline3-12 Unsupervised Supervised Unsupervised Supervised
Target Bias Base LfF LfF BPA (ours) Group DRO Base LfF LfF BPA (ours) Group DRO
Object Place 84.63 85.48 84.57 87.05 88.99 62.39 68.02 61.68 71.39 80.82
Place Object 87.99 85.77 85.05 88.44 89.20 73.34 62.37 60.00 79.16 85.27
Table 3: Unbiased accuracy (%) on the valid split of the Colored-MNIST dataset.
Unsupervised Supervised
Target Bias Baseline LfF BPA (ours) Group DRO
Digit Color 74.48 85.15 85.26 85.88
Color Digit 99.95 99.91 99.82 98.96

4.1 Dataset

CelebA [22] is a large-scale face dataset for face image recognition, containing 40 attributes for each image. Following the previous works [26, 25], we set hair color and heavy makeup as the target attribute. Note that gender attribute is spuriously correlated to the two attributes and employed as the bias attribute for worst-group accuracy evaluation in our experiment. For more comprehensive results, we also consider the other 32 attributes as the target attributes.

Waterbirds [26] is a synthesized dataset with 4,795 training examples, created by combining birds photographs from the Caltech-UCSD Birds-200-2011 (CUB) dataset [32] and the background images in the Places dataset [40]. There exist two attributes in the dataset; one is the type of a bird, {waterbird, landbird}, which is the target attribute, and the other is the background place, {water, land}.

The Colored-MNIST dataset [21, 25, 1] is an extension of MNIST with the color attributes, where each digit is highly correlated to a certain color. There are 60K training examples and 10K test images, where the ratio of bias-aligned samples333It denotes the samples that can be correctly classified by using the bias attribute (color). is 95%. We follow the protocol employed in [25] for the experiment.

4.2 Implementation details

For CelebA and Waterbirds, we use a ResNet-18 as our backbone network, which is pretrained on ImageNet. We train both the base and target models using the Adam optimizer with a learning rate of 1×1041\times 10^{-4}, a batch size of 256, and a weight decay rate of 0.01. For the Colored MNIST dataset, we adopt a multi-layer perceptron with three hidden layers, each of which has 100 hidden units. We also employ the same Adam optimizer with a learning rate of 1×1031\times 10^{-3}. We train the models for 100 epochs for all experiments, and decay the learning rate using the cosine annealing [23].

For clustering, we extract features from a separately trained base network with the standard classification loss and perform kk-means clustering with K=8K=8 in all experiments. The cluster weight of the kthk^{\text{th}} cluster, ωk\omega_{k}, is updated by exponential moving average at each iteration with a momentum mm of 0.3. All the results reported in our paper are obtained from the average of three runs.

4.3 Evaluation protocol

To evaluate the unbiased accuracy with an imbalanced evaluation set, we measure the accuracy of each group g=(atg=({a}_{t}, ab){a}_{b}), defined by a pair of target and bias attribute values. We finally report the average accuracy of each group and worst-group accuracy among all groups as in [25].

4.4 Results

We present our main results on the standard benchmarks including CelebA, WaterBirds, and Colored-MNIST. In the rest of this section, our method is denoted by BPA, which stands for bias pseudo-attribute.

CelebA

Before evaluating our frameworks, we first thoroughly analyze the CelebA dataset in terms of algorithmic bias among the attributes. There are a total of 40 attributes in the CelebA dataset. The bias attribute is fixed to gender, and we analyze its relation to the attributes of the target candidates given by the rest of 39 attributes. We accept the target attributes if the smallest group given by its combinations with the bias attribute in the test split contains at least 10 examples for statistical stability444The removed target attributes are 5 o’clock shadow, bald, rosy cheeks, sideburns, goatee, mustache, and wearing necktie.. We suppose that the spurious correlation exists between target and bias attributes when a baseline model gives a large performance gap between its overall accuracy and unbiased accuracy (e.g., >>5% points). We found that 26 out of 32 attributes have spurious correlation to gender, and report the results for the attributes. See our supplementary files for more detailed analysis.

Table 1 presents the experimental results of the proposed algorithm on the CelebA dataset, in comparison to the existing methods as well as the baseline model. Our model significantly outperforms the baseline and LfF [25] for all target attributes in terms of both unbiased and worst-group accuracies. Note that our model is almost competitive to a supervised approach, Group DRO [26], without the explicit bias information. On the other hand, we observe that training the model with LfF deteriorates performance even compared to the baseline. This is because it fixes the feature extractor and only trains its classification layer at the end555https://github.com/alinlab/LfF. To conduct a meaningful comparison with the stable version of LfF, we first train the baseline model used in our experiment for 100 epochs and then fine-tune the classification layer only using the LfF algorithm; this revised version is referred to as LfF in the rest of this section. Although the performance of LfF is stable, the improvement by debiasing is still limited compared to Group DRO and our approach. Additional experimental results for other bias attributes are provided in our supplementary documents.

Table 4: Unbiased and worst-group accuracies on the CelebA dataset with the target attributes, where the algorithmic bias does not exist.
Unbiased accuracy (%) Worst-group accuracy (%)
\cdashline3-12 Unsupervised Supervised Unsupervised Supervised
Target Bias Base LfF LfF BPA (ours) Group DRO Base LfF LfF BPA (ours) Group DRO
Attractive Gender 76.05 30.18 75.97 77.90 78.35 63.61 6.09 64.78 65.20 66.30
Smiling Gender 91.66 74.62 91.20 92.08 91.64 88.49 60.09 88.65 90.06 88.48
Mouth Open Gender 93.10 81.85 92.96 93.45 93.64 91.52 66.92 92.44 92.27 91.69
High Cheekbones Gender 83.44 48.40 83.70 84.93 84.52 70.49 7.92 73.56 78.56 78.37
Eyeglasses Gender 98.20 85.47 98.38 98.39 98.65 96.24 76.89 96.85 97.22 97.64
Black Hair Gender 84.92 61.00 85.19 86.57 86.76 75.47 22.04 75.69 81.28 80.67
Average Gender 87.90 63.59 87.92 88.89 88.93 80.97 39.51 82.29 84.10 83.86

Waterbirds

We also evaluate our model on the Waterbirds dataset and present the results in Table 2. As in the CelebA dataset, our model achieves the best accuracies among the unsupervised methods in terms of both the unbiased and worst-group accuracies, and presents comparable results to the supervised method [26]. Our successful results on Waterbirds imply that the proposed method is robust to small-scale datasets as well.

Colored-MNIST

Table 3 shows that our model achieves consistent accuracy in the digit classification with color bias. Besides, the color classification performance, where the algorithmic bias does not exist, turns out to be also competitive to the baseline, while the supervised approach is not good at this setting.

Table 5: Unbiased accuracy (%) with multiple bias attributes. For each target, our model only requires a single model, while Group DRO [26] should train separate models depending on the bias set.
Target Blond Hair Blurry
Unsupervised Supervised Unsupervised Supervised
Biases Baseline LfF BPA (ours) Group DRO Baseline LfF BPA (ours) Group DRO
Gender 80.42 84.89 90.18 91.39 73.05 76.70 88.93 87.05
Gender, Heavy Makeup 83.64 88.82 91.90 81.09 75.37 79.55 89.09 72.17
Gender, Wearing Lipstick 80.34 84.13 91.63 85.93 79.88 83.21 89.79 79.88
Gender, Young 78.39 81.21 89.05 87.96 72.97 77.77 88.66 85.39
Gender, No Beard 79.50 82.51 89.92 85.01 78.91 79.84 84.06 81.07
Gender, Wearing Necklace 79.25 81.03 92.62 92.26 71.80 78.07 89.60 85.33
Gender, Big Nose 81.18 84.10 90.58 90.83 71.89 77.11 88.57 87.11
Gender, Smiling 79.75 82.91 89.85 91.73 73.31 78.04 89.32 87.87
Average 80.29 ±\pm1.71 83.53 ±\pm2.64 90.79 ±\pm1.29 87.83 ±\pm4.10 74.88 ±\pm3.32 79.08 ±\pm2.06 88.44 ±\pm1.98 82.69 ±\pm5.50

4.5 Analysis

Results with no algorithmic bias

We test our algorithm on unbiased datasets to make sure that it is dependable on the cases without algorithmic bias. The unbiased setting is defined by the configuration that a baseline model involves a marginal difference between its overall accuracy and unbiased accuracy (e.g., <5<5% points). Similar to Table 1, we identify a subset of target attributes in CelebA, which is not spuriously correlated to gender; there exist 6 out of 32 attributes. Table 4 illustrates the results of the 6 target attributes, where the accuracy of our approach is most consistent among the 4 methods. This implies that our framework can be incorporated into the existing recognition models directly, without knowing the presence of dataset bias. Note that color classification with digit bias on Colored-MNIST or background place classification with object bias on Waterbirds are also qualified as unbiased settings, where our model gives consistent results.

Multiple bias attributes

Thanks to the unsupervised nature of our method, we can simply evaluate our model on multi-bias scenarios, where multiple bias attributes exist in the dataset, without modification. Table 5 presents the unbiased results with multiple bias attributes using our method and Group DRO [26], where we additionally report the average and standard deviation over all bias sets to compare the overall effectiveness and robustness. We also present the results of LfF, a variant of LfF [25], introduced in Table 1. When trained on multiple bias attributes, the accuracy of Group DRO is sensitive to bias sets while our method achieves stable and superior results for a variety of sets. Note also that our model is applicable to any bias sets without additional fine-tuning while a supervised method should separately train their model for each set.

Ablative results on importance weighting

Table 6: Ablation results on our importance weighting scheme on CelebA with blond hair and gender for target and bias attributes, respectively, in terms of unbiased and worst-group accuracies (%).
Scale Loss Unbiased Acc. Worst-group Acc.
80.42 41.02
83.86 57.44
89.08 76.55
90.18 82.54

We perform the ablative experiments on the CelebA dataset to analyze the effectiveness of our cluster weighting strategy. Table 6 presents the results when only one of the scale NkN_{k} and the average loss 𝔼(𝐱,y)Pk[((𝐱,y);θ)]{\mathbb{E}_{(\mathbf{x},y)\sim{P_{k}}}[\ell((\mathbf{x},y);{\theta})]}, respectively, are taken into account to compute ωk\omega_{k} for the kthk^{\text{th}} cluster in Eq. (3.4). Note that our ablative model with the loss factor only is closely related to [30]. Table 6 also shows that combining both terms plays a crucial role for learning debiased representations while the scale factor is clearly more important.

Table 7: Unbiased accuracy (%\%) with bias-unspecified settings. The results are the average of a set of unbiased accuracies, each of which adopts one of the 25 unspecified attributes jointly with the specified bias attribute, gender, as the bias attributes to define groups.
Target Baseline BPA (ours) Group DRO
Blond Hair 79.13 ±\pm 2.72 90.16 ±\pm 3.19 90.82 ±\pm 2.76
Heavy Makeup 70.26 ±\pm 3.84 73.52 ±\pm 3.86 71.57 ±\pm 4.33
Young 77.56 ±\pm 1.80 81.31 ±\pm 2.31 80.56 ±\pm 2.09
Double Chin 62.56 ±\pm 2.22 81.71 ±\pm 3.61 78.46 ±\pm 3.37
Chubby 67.80 ±\pm 2.77 82.36 ±\pm 3.55 79.91 ±\pm 3.43
Wearing Hat 90.80 ±\pm 4.01 95.11 ±\pm 3.10 94.71 ±\pm 3.76
Oval Face 61.77 ±\pm 1.80 66.63 ±\pm 1.63 65.35 ±\pm 1.45
Pointy Nose 63.96 ±\pm1.42 70.53 ±\pm1.52 70.16 ±\pm1.45

Unspecified group shifts

To verify the robustness in another realistic scenario, we test unspecified group shifts, where the group information at test time is not fully provided during training. The bias attribute specified during training, which is exploited by group DRO, is fixed to gender. To evaluate the performance in this setting, we measure a set of unbiased accuracies corresponding to the combinations of the specified bias attribute, gender, and each of 25 unspecified bias attributes except the target attribute666There exist 26 (out of 39) valid attributes as described in Section 4.4.. Note that the unspecified bias attributes are not introduced during training but used to define groups at test time. Table 7 clearly shows that our model outperforms Group DRO in the bias-unspecified setting, where we only report the average and standard deviation of the set of unbiased accuracies due to space limitation. This implies that, although Group DRO handles group shifts well within the simplex of the specified group distributions, it suffers from worst-case generalization for unspecified group shifts. Note that the proposed approach is free from the issue because it does not use any information about bias for training.

Sensitivity analysis on the number of clusters

We conduct ablation study on the number of groups for clustering on the feature embedding space to obtain bias pseudo-attributes on the CelebA dataset. We set gender as the bias attribute and evaluate the unbiased accuracies for several target attributes. Figure 2 presents that the results are stable over a wide range of the number of clusters and the accuracy is saturated when K4K\geq 4.

Refer to caption
Figure 2: Sensitivity analysis on the number of clusters in our framework on the CelebA dataset.

Feature visualization

Figure 3 illustrates the t-SNE results of instance embeddings for the baseline model (left) and ours (right) on the CelebA dataset for the blond hair attribute classification, where we visualize only negative (blond hair = false) examples for an effective visualization. Blue and orange colors indicate values—female and male, respectively—of the bias attribute, gender. We observe that our model helps to mix the examples in different groups within the same class, which is desirable for debiasing.

Refer to caption
Figure 3: The t-SNE plots of feature embeddings using baseline (left) and ours (right) trained to classify hair color. We visualize the distribution of samples that have the same target value (blond hair = false). Blue and orange colors denote different gender values, female and male, respectively. Our framework helps mix samples which have the same target but different bias attribute values.

5 Conclusion

We presented a generic unsupervised debiasing framework using pseudo-attribute. We observed that the examples sampled from the same groups are closely located in the feature embedding space. Based on our empirical observation, we claim that it is possible to identify the pseudo-attributes by taking advantage of the embedding results even without the attribute supervision. Inspired by this fact, we introduced a novel cluster-based weighting strategy for learning debiased representations. We demonstrated the effectiveness of our method on multiple standard benchmarks, which is even as competitive as the supervised debiasing method, group DRO. We also conducted a thorough analysis of our framework in many realistic scenarios, where our model provides substantial gains consistently.

Potential societal impact and limitation

Machine learning models typically focus on performance improvement unconditionally. Hence, it is often exposed to the risk caused by dataset and/or algorithmic bias, which need to be carefully addressed for enhancing reliability and robust of models. This research contributes to a positive societal impact from this point of view. Although the proposed algorithm turns out to be effective for bias identification, there may be blind spots due to unexplored types of bias. Therefore, we believe that the identification of hidden and unobservable biases without prior knowledge is a promising research direction.

Acknowledgments

This work was partly supported by the IITP grants [2021-0-02068, Artificial Intelligence Innovation Hub; 2021-0-01343, Artificial Intelligence Graduate School Program (Seoul National University)] and the National Research Foundation of Korea (NRF) grant [2022R1A2C3012210] funded by the Korean government (MSIT) and by Samsung Electronics Co., Ltd. [IO210917-08957-01].

References

  • [1] Hyojin Bahng, Sanghyuk Chun, Sangdoo Yun, Jaegul Choo, and Seong Joon Oh. Learning de-biased representations with biased representations. In ICML, 2020.
  • [2] Aharon Ben-Tal, Dick Den Hertog, Anja De Waegenaere, Bertrand Melenberg, and Gijs Rennen. Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):341–357, 2013.
  • [3] Remi Cadene, Corentin Dancette, Matthieu Cord, Devi Parikh, et al. Rubi: Reducing unimodal biases for visual question answering. In NeurIPS, 2019.
  • [4] Fabio M Carlucci, Antonio D’Innocente, Silvia Bucci, Barbara Caputo, and Tatiana Tommasi. Domain generalization by solving jigsaw puzzles. In CVPR, 2019.
  • [5] Alexandra Chouldechova and Aaron Roth. The frontiers of fairness in machine learning. arXiv preprint arXiv:1810.08810, 2018.
  • [6] Sanghyeok Chu, Dongwan Kim, and Bohyung Han. Learning debiased and disentangled representations for semantic segmentation. In NeurIPS, 2021.
  • [7] Christopher Clark, Mark Yatskar, and Luke Zettlemoyer. Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases. In EMNLP, 2019.
  • [8] Michele Donini, Luca Oneto, Shai Ben-David, John S Shawe-Taylor, and Massimiliano Pontil. Empirical risk minimization under fairness constraints. In NIPS, 2018.
  • [9] John Duchi, Peter Glynn, and Hongseok Namkoong. Statistics of robust optimization: A generalized empirical likelihood approach. arXiv preprint arXiv:1610.03425, 2016.
  • [10] John Duchi, Tatsunori Hashimoto, and Hongseok Namkoong. Distributionally robust losses for latent covariate mixtures. arXiv preprint arXiv:2007.13982, 2020.
  • [11] Rui Gao, Xi Chen, and Anton J Kleywegt. Wasserstein distributional robustness and regularization in statistical learning. arXiv preprint arXiv:1712.06050, 2017.
  • [12] Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks. arXiv preprint arXiv:2004.07780, 2020.
  • [13] Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. In ICLR, 2019.
  • [14] Lisa Anne Hendricks, Kaylee Burns, Kate Saenko, Trevor Darrell, and Anna Rohrbach. Women also snowboard: Overcoming bias in captioning models. In ECCV, 2018.
  • [15] Weihua Hu, Gang Niu, Issei Sato, and Masashi Sugiyama. Does distributionally robust supervised learning give robust classifiers? In ICML, 2018.
  • [16] Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, 2017.
  • [17] Johannes Kirschner, Ilija Bogunovic, Stefanie Jegelka, and Andreas Krause. Distributionally robust bayesian optimization. In AISTATS, 2020.
  • [18] Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. Deeper, broader and artier domain generalization. In ICCV, 2017.
  • [19] Tian Li, Maziar Sanjabi, Ahmad Beirami, and Virginia Smith. Fair resource allocation in federated learning. arXiv preprint arXiv:1905.10497, 2019.
  • [20] Yingwei Li, Yi Li, and Nuno Vasconcelos. Resound: Towards action recognition without representation bias. In ECCV, 2018.
  • [21] Yi Li and Nuno Vasconcelos. Repair: Removing representation bias by dataset resampling. In CVPR, 2019.
  • [22] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In ICCV, 2015.
  • [23] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. In ICLR, 2017.
  • [24] Mehryar Mohri, Gary Sivek, and Ananda Theertha Suresh. Agnostic federated learning. arXiv preprint arXiv:1902.00146, 2019.
  • [25] Junhyun Nam, Hyuntak Cha, Sungsoo Ahn, Jaeho Lee, and Jinwoo Shin. Learning from failure: Training debiased classifier from biased classifier. In NeurIPS, 2020.
  • [26] Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. In ICLR, 2020.
  • [27] Shiori Sagawa, Aditi Raghunathan, Pang Wei Koh, and Percy Liang. An investigation of why overparameterization exacerbates spurious correlations. In ICML, 2020.
  • [28] Seonguk Seo, Joon-Young Lee, and Bohyung Han. Information-theoretic bias reduction via causal view of spurious correlation. In AAAI, 2022.
  • [29] Seonguk Seo, Yumin Suh, Dongwan Kim, Geeho Kim, Jongwoo Han, and Bohyung Han. Learning to optimize domain specific normalization for domain generalization. In ECCV, 2020.
  • [30] Nimit Sohoni, Jared Dunnmon, Geoffrey Angus, Albert Gu, and Christopher Ré. No subclass left behind: Fine-grained robustness in coarse-grained classification problems. In NeurIPS, 2020.
  • [31] Antonio Torralba and Alexei A Efros. Unbiased look at dataset bias. In CVPR, 2011.
  • [32] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. 2011.
  • [33] Angelina Wang, Arvind Narayanan, and Olga Russakovsky. REVISE: A tool for measuring and mitigating bias in visual datasets. In ECCV, 2020.
  • [34] Haohan Wang, Zexue He, Zachary L. Lipton, and Eric P. Xing. Learning robust representations by projecting superficial statistics out. In ICLR, 2019.
  • [35] Mei Wang, Weihong Deng, Jiani Hu, Xunqiang Tao, and Yaohai Huang. Racial faces in the wild: Reducing racial bias by information maximization adaptation network. In ICCV, 2019.
  • [36] Tianlu Wang, Jieyu Zhao, Mark Yatskar, Kai-Wei Chang, and Vicente Ordonez. Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations. In ICCV, 2019.
  • [37] Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. Learning fair representations. In ICML, 2013.
  • [38] Marvin Zhang, Henrik Marklund, Abhishek Gupta, Sergey Levine, and Chelsea Finn. Adaptive risk minimization: A meta-learning approach for tackling group shift. arXiv preprint arXiv:2007.02931, 2020.
  • [39] Yi Zhang and Jitao Sang. Towards accuracy-fairness paradox: Adversarial example-based data augmentation for visual debiasing. arXiv preprint arXiv:2007.13632, 2020.
  • [40] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. TPAMI, 40(6):1452–1464, 2017.

A Full experimental results

Table A and B present the full results of Table 1 in the main paper, including George [30] and a class weighting method. George [30] is closely related to our ablative model with sample weighting based on its loss, which is shown in Table 6 of the main paper, while class weighting approach adjusts the weight of each example depending on the associated class scale (size) to mitigate the class imbalance issue. We also report the gap between the overall accuracy and the unbiased accuracy of the baseline model to present the degree of algorithmic bias for each target attribute with gender bias. Bold and underline fonts indicate the first and second place among the compared approaches, respectively. The proposed approach achieves outstanding performance compared to all other unsupervised methods, and is even as competitive as the supervised counterpart [26]. Also, it is surprising that the class weighting method is superior to existing unsupervised debiasing methods including LfF [25] and George [30]. We run all experimental three times and compute average accuracies and their standard deviations.

Table A: Unbiased accuracy (%\%) in the presence of spurious correlations between target and bias attributes on the test split of the CelebA dataset.
Unsupervised Supervised
Target Gap (%p) Overall Baseline LfF*{}^{\text{*}} [25] George [30] Class weighting Ours Group DRO [26]
Blond Hair -15.28 95.70 80.42 ±\pm 0.51 84.89 ±\pm 0.14 83.13 ±\pm 1.86 83.35 ±\pm 0.85 90.18 ±\pm 0.23 91.39 ±\pm 0.27
Heavy Makeup -19.63 90.82 71.19 ±\pm 0.37 71.85 ±\pm 0.17 70.91 ±\pm 0.77 71.74 ±\pm 0.83 73.78 ±\pm 0.25 72.70 ±\pm 0.71
Pale Skin -25.25 96.75 71.50 ±\pm 1.60 75.23 ±\pm 0.74 78.22 ±\pm 3.75 90.02 ±\pm 0.56 90.06 ±\pm 0.75 90.55 ±\pm 0.84
Wearing Lipstick -18.70 92.60 73.90 ±\pm 0.53 73.84 ±\pm 0.05 78.05 ±\pm 0.98 72.89 ±\pm 1.28 78.28 ±\pm 0.88 78.26 ±\pm 2.73
Young -9.30 87.49 78.19 ±\pm 0.39 79.58 ±\pm 0.14 80.79 ±\pm 0.20 82.13 ±\pm 0.82 82.27 ±\pm 0.65 82.40 ±\pm 0.48
Double Chin -31.32 95.93 64.61 ±\pm 0.82 68.47 ±\pm 0.22 76.23 ±\pm 0.11 82.13 ±\pm 1.43 82.92 ±\pm 0.54 83.19 ±\pm 1.11
Chubby -27.97 95.39 67.42 ±\pm 0.95 71.56 ±\pm 0.52 74.88 ±\pm 1.91 79.64 ±\pm 0.56 83.88 ±\pm 0.36 81.90 ±\pm 0.20
Wearing Hat -5.57 99.10 93.53 ±\pm 0.37 94.81 ±\pm 0.15 95.72 ±\pm 0.71 96.16 ±\pm 0.50 96.80 ±\pm 0.26 96.84 ±\pm 0.46
Oval Face -10.40 73.10 62.70 ±\pm 0.62 62.30 ±\pm 0.21 65.16 ±\pm 0.23 65.13 ±\pm 1.05 67.18 ±\pm 0.82 65.40 ±\pm 0.14
Pointy Nose -11.81 73.91 62.10 ±\pm 0.74 63.83 ±\pm 0.28 61.68 ±\pm 1.59 66.82 ±\pm 2.76 68.90 ±\pm 0.90 70.71 ±\pm 0.28
Straight Hair -12.24 82.52 70.28 ±\pm 1.06 72.84 ±\pm 0.12 77.80 ±\pm 0.19 77.46 ±\pm 0.70 79.18 ±\pm 0.38 77.04 ±\pm 0.70
Blurry -22.98 96.03 73.05 ±\pm 1.28 77.52 ±\pm 0.45 81.28 ±\pm 0.28 87.75 ±\pm 0.87 88.93 ±\pm 0.32 87.05 ±\pm 0.90
Narrow Eyes -23.29 86.47 63.18 ±\pm 1.05 67.77 ±\pm 0.08 68.03 ±\pm 0.11 70.99 ±\pm 0.60 76.39 ±\pm 0.64 76.72 ±\pm 1.98
Arched Eyebrows -12.09 81.81 69.72 ±\pm 0.37 71.87 ±\pm 0.10 73.25 ±\pm 0.29 75.58 ±\pm 1.13 74.77 ±\pm 0.69 78.30 ±\pm 1.79
Bags Under Eyes -14.16 83.63 69.47 ±\pm 0.57 71.86 ±\pm 0.05 74.81 ±\pm 0.38 76.36 ±\pm 1.05 77.84 ±\pm 1.14 75.88 ±\pm 1.18
Bangs -6.37 95.41 89.04 ±\pm 0.47 89.04 ±\pm 0.50 92.62 ±\pm 0.12 93.09 ±\pm 0.29 93.94 ±\pm 0.57 94.45 ±\pm 0.17
Big Lips -8.99 69.86 60.87 ±\pm 0.58 62.15 ±\pm 0.06 64.99 ±\pm 0.13 63.74 ±\pm 0.56 66.50 ±\pm 0.24 63.70 ±\pm 0.44
No Beard -22.73 95.84 73.11 ±\pm 0.90 73.13 ±\pm 0.89 77.90 ±\pm 0.20 77.83 ±\pm 2.29 79.58 ±\pm 0.14 77.86 ±\pm 1.35
Receding Hairline -23.31 93.03 69.72 ±\pm 0.78 74.58 ±\pm 0.21 78.86 ±\pm 0.40 82.97 ±\pm 0.97 84.95 ±\pm 0.49 85.15 ±\pm 1.31
Wavy Hair -9.19 82.29 73.10 ±\pm 0.56 74.53 ±\pm 0.17 77.39 ±\pm 0.15 76.50 ±\pm 0.65 79.89 ±\pm 0.71 79.65 ±\pm 0.63
Wearing Earrings -17.18 89.35 72.17 ±\pm 0.91 74.17 ±\pm 0.33 80.65 ±\pm 0.04 78.65 ±\pm 0.28 84.57 ±\pm 0.69 83.50 ±\pm 0.63
Wearing Necklace -30.73 85.77 55.04 ±\pm 0.59 57.21 ±\pm 0.76 58.79 ±\pm 0.10 67.05 ±\pm 1.37 68.96 ±\pm 0.12 62.89 ±\pm 3.69
Big Nose -14.74 82.44 67.70 ±\pm 1.11 69.75 ±\pm 0.03 71.85 ±\pm 0.18 70.52 ±\pm 1.02 74.21 ±\pm 0.43 73.73 ±\pm 0.27
Brown Hair -8.88 86.95 78.07 ±\pm 0.87 78.93 ±\pm 1.24 83.07 ±\pm 0.07 83.12 ±\pm 0.38 83.83 ±\pm 0.66 84.87 ±\pm 0.07
Bushy Eyebrows -17.02 91.44 74.42 ±\pm 0.91 75.20 ±\pm 0.34 80.99 ±\pm 0.32 82.73 ±\pm 1.21 85.02 ±\pm 0.02 85.43 ±\pm 0.19
Gray Hair -20.54 98.01 77.47 ±\pm 0.67 80.09 ±\pm 0.21 86.10 ±\pm 1.18 90.12 ±\pm 1.12 91.80 ±\pm 0.22 92.52 ±\pm 0.14
Average -16.91 88.52 71.61 73.73 76.66 78.63 80.93 80.46
Table B: Worst-group accuracy (%\%) in the presence of spurious correlation between target and bias attributes on the test split of the CelebA dataset.
Unsupervised Supervised
Target Gap (%p) Overall Baseline LfF*{}^{\text{*}} [25] George [30] Class weighting Ours Group DRO [26]
Blond Hair -54.68 95.70 41.02 ±\pm 1.96 57.96 ±\pm 2.00 65.45 ±\pm 15.52 53.58 ±\pm 3.10 82.54 ±\pm 1.22 87.86 ±\pm 0.10
Heavy Makeup -73.47 90.82 17.35 ±\pm 4.60 23.87 ±\pm 2.79 9.09 ±\pm 1.24 28.86 ±\pm 11.91 39.84 ±\pm 2.28 21.36 ±\pm 1.36
Pale Skin -60.11 96.75 36.64 ±\pm 3.53 43.26 ±\pm 1.40 62.03 ±\pm 16.50 85.42 ±\pm 1.70 88.60 ±\pm 1.48 87.68 ±\pm 2.37
Wearing Lipstick -61.22 92.60 31.38 ±\pm 4.27 31.92 ±\pm 0.02 51.04 ±\pm 2.59 27.68 ±\pm 3.45 46.52 ±\pm 1.62 46.08 ±\pm 5.57
Young -34.70 87.49 52.79 ±\pm 1.45 57.79 ±\pm 0.84 65.12 ±\pm 0.88 71.43 ±\pm 1.75 74.33 ±\pm 0.70 76.29 ±\pm 1.96
Double Chin -74.60 95.93 21.33 ±\pm 2.24 28.24 ±\pm 0.46 50.00 ±\pm 0.41 62.43 ±\pm 4.71 67.78 ±\pm 0.91 72.94 ±\pm 1.14
Chubby -71.09 95.39 24.30 ±\pm 3.73 34.09 ±\pm 0.90 58.01 ±\pm 11.04 52.76 ±\pm 2.59 72.32 ±\pm 0.93 72.64 ±\pm 1.70
Wearing Hat -13.98 99.10 85.12 ±\pm 0.31 88.31 ±\pm 0.12 92.93 ±\pm 0.76 93.61 ±\pm 0.32 94.94 ±\pm 0.19 94.67 ±\pm 0.41
Oval Face -43.95 73.10 29.15 ±\pm 2.76 36.00 ±\pm 1.46 38.01 ±\pm 2.63 43.52 ±\pm 6.37 55.78 ±\pm 0.94 56.84 ±\pm 1.83
Pointy Nose -48.11 73.91 25.80 ±\pm 4.03 38.04 ±\pm 1.49 22.63 ±\pm 3.67 47.46 ±\pm 3.75 52.48 ±\pm 0.52 63.76 ±\pm 2.80
Straight Hair -34.70 82.52 47.82 ±\pm 6.75 58.53 ±\pm 1.61 69.23 ±\pm 1.24 68.97 ±\pm 1.15 72.09 ±\pm 0.76 66.10 ±\pm 3.56
Blurry -50.35 96.03 45.68 ±\pm 3.98 52.35 ±\pm 1.18 62.23 ±\pm 1.58 82.30 ±\pm 3.05 84.10 ±\pm 0.73 82.06 ±\pm 2.27
Narrow Eyes -59.46 86.47 27.01 ±\pm 1.30 38.53 ±\pm 0.44 35.16 ±\pm 1.14 52.62 ±\pm 4.11 73.24 ±\pm 0.88 71.47 ±\pm 3.72
Arched Eyebrows -47.05 81.81 34.76 ±\pm 1.86 44.97 ±\pm 0.46 45.64 ±\pm 1.21 52.94±\pm 5.28 54.36 ±\pm 1.37 69.44 ±\pm 5.44
Bags Under Eyes -41.98 83.63 41.65 ±\pm 1.01 49.10 ±\pm 0.49 56.28 ±\pm 2.11 59.77 ±\pm 8.13 62.55 ±\pm 0.90 63.34 ±\pm 3.02
Bangs -18.50 95.41 76.91 ±\pm 3.27 82.37 ±\pm 0.52 85.90 ±\pm 0.24 87.91 ±\pm 1.80 92.21 ±\pm 1.24 92.12 ±\pm 1.03
Big Lips -39.01 69.86 30.85 ±\pm 0.62 38.54 ±\pm 0.18 44.51 ±\pm 0.83 43.16 ±\pm 5.62 56.99 ±\pm 3.05 47.55 ±\pm 1.03
No Beard -82.54 95.84 13.30 ±\pm 3.87 20.00 ±\pm 0.00 33.33 ±\pm 5.77 30.00 ±\pm 10.00 40.00 ±\pm 0.00 36.70 ±\pm 5.10
Receding Hairline -57.34 93.03 35.69 ±\pm 0.35 45.53 ±\pm 0.55 57.30 ±\pm 0.90 72.14 ±\pm 2.56 79.12 ±\pm 1.91 79.12 ±\pm 2.11
Wavy Hair -44.28 82.29 38.01 ±\pm 0.85 45.24 ±\pm 0.83 53.17 ±\pm 0.43 49.69 ±\pm 4.65 65.74 ±\pm 1.13 66.79 ±\pm 1.62
Wearing Earrings -63.09 89.35 26.26 ±\pm 4.14 32.95 ±\pm 1.31 52.74 ±\pm 1.10 47.18 ±\pm 4.08 72.81 ±\pm 1.50 75.24 ±\pm 2.10
Wearing Necklace -83.05 85.77 2.72 ±\pm 0.83 6.67 ±\pm 2.07 13.82 ±\pm 0.41 30.36 ±\pm 3.36 41.93 ±\pm 2.47 24.34 ±\pm 7.81
Big Nose -49.25 82.44 33.19 ±\pm 3.97 45.30 ±\pm 0.50 46.22 ±\pm 0.41 49.56 ±\pm 4.79 63.00 ±\pm 4.27 65.08 ±\pm 1.17
Brown Hair -27.37 86.95 59.58 ±\pm 2.55 60.68 ±\pm 3.62 73.20 ±\pm 0.88 70.91 ±\pm 3.09 71.50 ±\pm 0.97 78.92 ±\pm 1.61
Bushy Eyebrows -54.30 91.44 37.14 ±\pm 2.54 52.67 ±\pm 3.14 56.08 ±\pm 0.97 66.92 ±\pm 6.98 74.08 ±\pm 0.75 81.56 ±\pm 3.24
Gray Hair -55.52 98.01 42.49 ±\pm 1.86 48.46 ±\pm 1.09 67.23 ±\pm 2.75 80.00 ±\pm 3.78 83.03 ±\pm 1.37 88.55 ±\pm 1.85
Average 51.68 88.79 36.84 44.67 50.39 58.00 67.76 68.02

B Additional Analysis

Refer to caption
Figure A: Heatmap of unbiased accuracy (%) with 4 different methods. Unlike previous tables, we evaluate our model with various bias attributes, in addition to Male (gender), on the CelebA dataset. To be specific, we select 8 attributes and evaluate unbiased accuracies with all possible (target, bias) pairs among the attributes. For each figure, the columns and rows denote bias and target attributes, respectively. Our approach substantially improves unbiased accuracies for various bias attributes consistently.

Unbiased results with various bias attributes

To make our study more comprehensive, we also evaluate our model with various bias attributes, in addition to Male (gender), on the CelebA dataset. Specifically, we select 8 attributes777The selected attributes are male, blond hair, heavy makeup, pale skin, wearing lipstick, young, double chin and chubby. and test our model with all possible (target, bias) pairs among the attributes. Figure A visualizes the experimental results with different methods, including baseline, George [30], group DRO [26] and our approach, in terms of unbiased accuracy (%). The columns and rows denote bias and target attributes, respectively. As shown in the figure, our model improves unbiased accuracies substantially for various bias attributes, which outperforms baseline and George [30] and is even as competitive as group DRO [26].

Refer to caption
Figure B: Heatmap of the performance gap between overall accuracy and unbiased accuracy (%p) with 4 different methods. We use the same experimental setup with Figure A. The columns and rows denote bias and target attributes, respectively. In subfigure (a), the larger the performance gap, the more severe the algorithmic bias. As shown in the figure, even for the same target attribute, the gap varies largely depending on bias attributes. Subfigure (b), (c) and (d) demonstrate that all methods mitigate the algorithmic bias, while our approach is more effective than George.

Algorithmic bias with various bias attributes

In Figure B, we visualize the performance gap between overall accuracy and unbiased accuracy for each method to analyze the degree of algorithmic bias between target and bias attributes. We use the same experimental setting with Figure A. The larger the performance gap, the more severe the algorithmic bias. This implies that, based on the performance gap from Figure B (a), we can measure the existence of algorithmic bias on the CelebA dataset, e.g., the target attribute Heavy Makeup is spuriously correlated to Male and Wearing Lipstick biases while not to Young, Double Chin and Chubby biases.888As in the main paper, we suppose that the algorithmic bias exists between target and bias attributes when a baseline model gives a large performance gap between its overall accuracy and unbiased accuracy (e.g., >> 5% points). As shown in the figure, even with the same target attribute, the gap varies largely depending on bias attributes. We also observe that the algorithmic bias does not exist symmetrically, e.g., the target attribute Chubby is spuriously correlated to Heavy Makeup bias, not vice versa. Compared to the baseline, all methods reduce the algorithmic bias while our framework is more effective than George [30] and as competitive as group DRO [26].

Multi-target classification

We tested our framework with another setting, called multi-target classification, where a single backbone model adopts multiple classification heads. To this end, we attached multiple linear classification layers, which correspond to individual targets, respectively, to a shared feature extractor. For evaluation, we calculate unbiased accuracy for each target attribute separately, where the bias attribute is fixed to gender. Table C presents the multi-target classification results with several target attribute pairs, where our model achieves consistently better results than the compared unsupervised method in terms of unbiased accuracy, while it is as competitive as group DRO [26].

Table C: Unbiased accuracy (%) with multi-target classification scenario. In this setting, each model is trained to classify multiple attributes jointly by adopting separate linear branches. The bias attribute is fixed to Male. We report the unbiased accuracy for each target attribute separately.
Unsupervised Supervised
Targets Baseline George [30] Ours Group DRO [26]
Blond Hair / Heavy Makeup 78.92 / 71.46 83.06 / 70.99 89.78 / 72.25 90.38 / 70.94
Blond Hair / Wearing Lipstick 80.76 / 71.91 82.08 / 73.06 89.09 / 77.34 88.86 / 78.45
Straight Hair / Oval Face 69.93 / 60.84 76.77 / 63.84 78.33 / 64.85 76.38 / 64.77
Straight Hair / Big Lips 70.05 / 60.14 69.73 / 63.22 76.03 / 66.39 77.06 / 64.07
Blurry / Pale Skin 73.89 / 68.18 79.51 / 79.45 87.35 / 89.28 87.82 / 85.91
Blurry / Young 76.48 / 77.82 74.66 / 77.16 88.64 / 82.79 88.57 / 82.14