Label-Aware Distribution Calibration for Long-tailed Classification

Chaozheng Wang\equalcontrib¹,Shuzheng Gao\equalcontrib¹,Pengyun Wang², Cuiyun Gao¹,
Wenjie Pei¹,Lujia Pan²,Zenglin Xu¹ Corresponding author

Abstract

Real-world data usually present long-tailed distributions. Training on imbalanced data tends to render neural networks perform well on head classes while much worse on tail classes. The severe sparseness of training instances for the tail classes is the main challenge, which results in biased distribution estimation during training. Plenty of efforts have been devoted to ameliorating the challenge, including data re-sampling and synthesizing new training instances for tail classes. However, no prior research has exploited the transferable knowledge from head classes to tail classes for calibrating the distribution of tail classes. In this paper, we suppose that tail classes can be enriched by similar head classes and propose a novel distribution calibration approach named as Label-Aware Distribution Calibration (LADC). LADC transfers the statistics from relevant head classes to infer the distribution of tail classes. Sampling from calibrated distribution further facilitates re-balancing the classifier. Experiments on both image and text long-tailed datasets demonstrate that LADC significantly outperforms existing methods. The visualization also shows that LADC provides a more accurate distribution estimation.

Introduction

Refer to caption — Figure 1: Visualization for the cosine similarities between the means (a) or variances (b) of classes in the feature space of CIFAR10-LT. The feature vectors are computed by the ResNet-32 backbone. The first four classes including $airplane$ , $automobile$ , $bird$ , and $cat$ are head classes, while the others are tail classes. The blue rectangles highlight the high similarities between some head classes and tail classes, such as $cat$ and $dog$ , and $automobile$ and $truck$ .

Classification tasks such as image classification and natural language classification are fundamental and essential for evaluating the learning ability of neural networks. Generally, the neural networks are trained and evaluated on balanced datasets including CIFAR and ImageNet (Deng et al. 2009). However, real-word datasets are imbalanced and usually present long-tailed distributions, that is, head classes contain more training instances while tail classes are with significantly fewer instances. Models trained on long-tailed data will result in obvious performance degradation, e.g., much worse performance for the tail classes compared with the performance for the head classes. The main challenge lies in the sparsity of tail classes, leading to estimation of the decision boundaries severely biased to head classes.

Re-weighting methods (Cui et al. 2019; Jamal et al. 2020; Cao et al. 2019) are one popular solution to tackle the challenge. For example, the work (Cui et al. 2019) re-design training loss functions to highlight the contribution of tail classes. Recently, Cao et al. (2019) find that traditional end-to-end re-weighting methods present non-promising results. So they propose to train the model in a normal way in the first stage and tune the model with deferred re-sampling (DRS) and deferred re-weighting (DRW) after annealing the learning rate. Kang et al. (2019) follow the two-stage training process, and decouple the learning of backbone and classifier to mitigate the impact of imbalanced data on biasing the classifier. Specifically, for re-balancing the decision boundary, they re-train the classifier (cRT) and learn scaling weights for classifier (LWS) via class-balanced sampling in the second training stage. Zhong et al. (2021) later propose label-aware smoothing to regularize the classifier. Although the prior studies make some achievements, they have not exploited the inter-class similarity, especially the transferable knowledge from head classes to tail classes.

There are also some studies (Chawla et al. 2002; Li et al. 2021; Kim, Jeong, and Shin 2020) focusing on synthesizing new training instances for tail classes to solve the data sparsity issue. For example, M2m (Kim, Jeong, and Shin 2020) generates minority samples by translating majority samples. MetaSAug (Li et al. 2021) produces new training instances via implicit semantic data augmentation (Wang et al. 2019). However, these studies pay more attention to the surface features, without explicitly exploring the in-depth feature space.

Inspired by the distribution calibration method (DC) in the few-shot learning (Yang, Liu, and Xu 2021), which estimates the in-depth feature distribution of support set by utilizing the knowledge from base set, we propose a novel label-aware distribution calibration method (LADC) for our long-tailed scenarios. LADC assumes that the distribution of tail classes can be enriched by similar head classes in the feature space. Following the DC method, LADC also assumes that every dimension in the feature space follows a Gaussian distribution, and similar classes have similar means and variances of the feature representations. The proposal of LADC is based on the observation that there exists some similarity between head classes and tail classes. As shown in Fig. 1, we can observe that the statistics of the head class $automobile$ are highly relevant to those of the tail class $truck$ . So the feature distribution of the $truck$ class can be better calibrated by virtue of the $automobile$ class which contains sufficient instances for training.

Specifically, LADC decouples the training process into two stages. In the first training stage, the backbone is trained with data augmentation techniques including Re-balanced Mixup (Chou et al. 2020) and RandAugment (Cubuk et al. 2020) to enhance the representation learning. Based on the trained backbone, LADC computes and records the mean feature and co-variance matrix of each head class. In the second training stage, LADC selects the most similar $m$ head classes for computing new mean and co-variance of calibrated distribution for every instance in tail classes. Finally, new training instances sampled from the calibrated distributions are employed to rectify the classifier in the first stage. In addition, we propose LWS-Plus to further improve the classifier rectification ability. In summary, our contributions are as follow:

1.

We are the first to exploit the transferable knowledge from head classes to tail classes for estimating the feature distributions of tail classes.
2.

We propose a novel method LADC for calibrating the distribution of tail classes by borrowing knowledge from head classes. A new sampling strategy is also proposed based on the calibrated distribution to better balance the classifier.
3.

Extensive experiments demonstrate that LADC can significantly outperform state-of-the-art approaches on long-tailed image datasets including CIFAR10-LT, CIFAR100-LT, ImageNet-LT and iNaturalist. Experimentation on long-tailed text classification task indicates that LADC is generalizable to other long-tailed modals.

Methodology

In this section, we elaborate on the proposed method LADC. LADC follows a two-stage training framework, in which a backbone is first trained on the original long-tailed dataset and fixed on the second stage. Fig. 2 depicts the overview of the proposed LADC. Equipped with the backbone and the associated representation space, we illustrate how LADC estimates the distribution of tail classes by utilizing the statistics of relevant head classes. Then we introduce the sampling strategy based on the calibrated distribution. We finally describe how LADC enhances the training of backbone and classifiers.

Problem Definition

We follow a typical long-tailed classification setting. Given an imbalanced dataset $D=\{\bm{x}_{i},y_{i}\}^{N}_{i=1}$ with $N$ training instances, where $\bm{x}_{i}$ denotes a instance and $y_{i}$ denotes its label. Assume that the classes are ordered by cardinality, i.e., when class index $p<q$ , then $n_{p}\geq n_{q}$ , where $n_{p}$ is the number of training instances of the class $p$ . Let $\mathcal{F}$ be the backbone and $\Phi$ be the classifier. Let $\bm{z}^{c}$ represent the class embedding of class $c$ in the feature space defined by $\mathcal{F}$ . Furthermore, we split classes into two groups with head classes denoted as $C_{h}$ and tail classes as $C_{t}$ . Note that we simplify the notations by dropping class index wherever possible.

Label-Aware Distribution Calibration

The DC method (Yang, Liu, and Xu 2021) for few shot learning focuses on $N$ -way $K$ -shot task where $N$ is fixed. Since long-tailed scenarios require models to perform well on both head and tail classes, directly adopting DC is non-optimal. To improve the prediction accuracy of tail classes and preserve the performance on head classes, we propose a novel label-aware distribution calibration approach.

Calibrating distribution of tail classes

Given the trained backbone, LADC calibrates the distributions of tail classes with the help of relevant head classes. We assume that each class is Gaussian distributed and head classes are well represented with sufficient data for training. So the distribution of each head class can be approximated by calculating the mean and co-variance of the associated training data:

p(\bm{z})=N(\bm{z}|\bm{\mu},\bm{\Sigma})

(1)

\bm{\mu}=\frac{\sum_{i=1}^{n}\mathcal{F}(\bm{x}_{i})}{n}

(2)

\bm{\Sigma}=\frac{1}{n-1}\sum_{i=1}^{n}(\mathcal{F}(\bm{x}_{i})-\bm{\mu})(\mathcal{F}(\bm{x}_{i})-\bm{\mu})^{T}

(3)

where $\bm{\mu}$ and $\bm{\Sigma}$ denote the mean and variance of the class, and $n$ denotes the number of training instances in the head class.

To more accurately estimate the distribution of each tail class, we employ the statistics of a set of similar head classes $S$ as the prior and compute the posterior by using the feature $\bm{\hat{\mu}}$ of the tail class. The similar head classes $S$ is measured by the Euclidean distance in the feature space:

Set^{d}=\{-||\bm{\mu}^{i}-\hat{\bm{\mu}}||^{2}|i\in C_{h}\}\\

(4)

S=\{i|-||\bm{\mu}^{i}-\hat{\bm{\mu}}||^{2}\in topm(Set^{d})\}

(5)

Equipped with $S$ , We can compute the prior distribution of $\bm{\hat{\mu}}$ :

\bm{w}_{i}=\frac{n_{i}||\bm{\mu}^{i}-\hat{\bm{\mu}}||^{2}}{\sum_{j\in\mathbb{S}}n_{j}||\bm{\mu}^{j}-\hat{\bm{\mu}}||^{2}}

(6)

\bm{\mu}_{0}=\sum_{i\in S}\bm{w}_{i}\bm{\mu}^{i}\quad\bm{\Sigma}_{0}=\sum_{i\in S}(\bm{w}_{i})^{2}\bm{\Sigma}^{i}+\alpha\bm{I}

(7)

p(\bm{z})=N(\bm{z}|\bm{\mu}_{0},\bm{\Sigma}_{0})

(8)

where $\alpha$ is a hyper-parameter to control the degree of dispersion and $\bm{w}$ is the weight vector which is designed to impose tail classes to learn more from head classes that are more abundant and more similar.

For each tail class, $\mathcal{F}(\bm{x})$ conditioned on $\bm{z}$ is assumed to follow a Gaussian distribution

p(\mathcal{F}(\bm{x})|\bm{z})=N(\mathcal{F}(\bm{x})|\bm{z},\bm{L})

(9)

where $\bm{L}$ is the co-variance matrix and it is related to the uncertainty of the trained backbone.

Based on the prior distribution, the posterior (calibrated) distribution of each instance in the tail class can be given by

p(\bm{z}|\{\mathcal{F}(\bm{x_{i}})\}_{i})=N(\bm{z}|\bm{\mu}^{\prime},\bm{\Sigma}^{\prime})

(10)

By defining $\bm{L}=\beta\bm{\Sigma}_{0}$ , we compute $\bm{\mu}^{\prime}$ and $\bm{\Sigma}^{\prime}$ as follows:

\bm{\mu}^{\prime}=\frac{\beta}{n_{s}+\beta}\bm{\mu}_{0}+\frac{n_{s}}{n_{s}+\beta}\hat{\bm{\mu}}

(11)

\bm{\Sigma}^{\prime}=\frac{\beta}{n_{s}+\beta}\bm{\Sigma}_{0}

(12)

where $n_{s}$ means the sample size for computing $\hat{\bm{\mu}}$ and is set as 1. $\beta$ is a hyper-parameter which can be interpreted as the relative uncertainty of the sample to the prior $\bm{\mu}_{0}$ . Higher $\beta$ indicates a higher confidence of the prior.

Sampling strategy

In the second training stage, to balance the long-tailed distribution, we define the sampling probability of each class to be:

P_{i}=\frac{n_{1}/n_{i}^{\tau}}{\sum_{j}n_{1}/n_{j}^{\tau}}

(13)

where $n_{1}$ refers to the number of training instances of the most represented class. $\tau$ is a temperature parameter and it controls the balanced degree of the sampled distribution. Note that we re-sampling head classes from the original distributions; while for the tail classes, we draw samples from the calibrated distributions.

Table 1: Overview of datasets used in our experiments. The “IF” denotes imbalance factor.

Datasets	#Class	IF	#Train Set	Min. Class Size	Max. Class Size	#Test Set
CIFAR10-LT	10	10 $\sim$ 200	50,000 $\sim$ 11,203	500 $\sim$ 25	5,000	10,000
CIFAR100-LT	100	10 $\sim$ 200	50,000 $\sim$ 9,502	500 $\sim$ 2	5,00	10,000
ImageNet-LT	1000	256	115,846	5	1280	50,000
iNaturalist2018	8142	500	437,513	2	1,000	24,426
THUNEWS-LT	10	100	44,723	180	18,000	10,000

Re-balancing classifier

In Stage 2, Kang et al. (2019) introduce two methods for classifier adjustment: cRT and LWS. cRT re-trains the classifiers completely, while LWS retains the direction of the classifiers and only adjusts the scales. In the work, we choose LWS due to its superior performance in our experiments. On the basis of LWS and inspired by logits adjustment work (Menon et al. 2020; Tang, Huang, and Zhang 2020), we add a learnable bias term to the logits, named LWS-Plus, for adjusting the classification boundaries more flexibly:

\hat{\Phi}_{i}(\bm{x})=\bm{f}_{i}\cdot\Phi_{i}(\bm{x})+\bm{g}_{i}

(14)

where $\Phi_{i}(\bm{x})$ refers to the logit score of the class $i$ , $\bm{f}_{i}$ denotes the scaling factor of the classifier magnitude, and $\bm{g}_{i}$ indicates the added bias.

Table 2: Top-1 accuracy (%) on CIFAR10-LT and CIFAR100-LT. Best results are marked in bold.

Approach	CIFAR10-LT				CIFAR100-LT
IF	200	100	50	10	200	100	50	10
Cross-entropy training	65.87	70.14	74.94	86.18	34.70	37.92	44.02	55.73
Class Balance Loss	68.77	72.68	78.13	86.90	35.56	38.77	44.79	57.57
Focal Loss	65.29	70.38	76.71	86.68	35.62	38.41	44.32	55.78
CB Focal Loss	68.15	74.57	79.22	87.48	36.23	39.60	45.21	57.99
LDAM Loss	66.75	73.55	78.83	87.32	36.53	40.60	46.16	57.29
LDAM-DRW	74.74	77.03	81.03	88.16	38.45	42.04	46.62	58.71
MCW with LDAM loss	77.23	80.00	82.23	87.40	39.53	44.08	49.16	58.00
Decoufound-TDE	-	80.60	83.60	88.50	-	44.10	50.30	59.60
cRT	69.48	73.02	79.56	87.90	38.16	43.30	47.37	57.86
LWS	67.90	72.44	77.77	87.41	37.52	42.97	47.40	58.08
MiSLAS	76.73	82.10	85.70	90.00	43.53	47.00	52.30	63.20
BBN	-	79.82	81.18	88.32	-	42.56	47.02	59.12
LDAM-DRW + SSP	-	77.83	82.13	88.53	-	43.43	47.11	58.91
Bag of tricks	-	80.03	83.59	-	-	47.83	51.69	-
SMOTE	-	71.50	-	85.70	-	34.00	-	53.80
Major-to-Minor	-	79.10	-	87.50	-	43.50	-	57.60
MetaSAug-LDAM	77.35	80.66	84.34	89.68	43.09	48.08	52.27	61.28
LADC	81.56	84.65	87.09	90.81	46.62	50.77	54.94	64.66

Representation Enhancement

In Stage 1, Mixup has proven to be an effective regularization strategy for training deep neural networks (Zhong et al. 2021; Zhang et al. 2021). More recently, Remix has been proposed for imbalanced data (Chou et al. 2020). In the work, we adopt Remix in the first training stage due to its better performance in the experiments. In addition, we combine RandAugment (Cubuk et al. 2020), a two-stage augmentation policy that uses random parameters in place of parameters tuned by AutoAugment.

Experiments

Experimental Setup

Datasets

We evaluate LADC on several popular long-tailed classification tasks, including both image and text data. Image datasets include CIFAR-10-LT, CIFAR-100-LT (Cui et al. 2019), ImageNet-LT (Cao et al. 2019) and iNaturalist (Van Horn et al. 2018). The text dataset includes THUNEWS-LT (Sun et al. 2016). Table 1 shows the detailed statistics of the datasets, where the “IF” indicates the imbalance factor, i.e., the ratio of the number of the most-represented class to the number of the least-represented class.

Baselines

We choose several types of comparison methods: (1) one-stage training methods. They include Class-Balanced loss (Cui et al. 2019), Focal loss (Lin et al. 2017), CB-Focal loss (Cui et al. 2019), LDAM loss (Cao et al. 2019), Balanced Softmax (Ren et al. 2020), Meta-class-weight LDAM (Jamal et al. 2020) and a causal model Decoufound-TDE (Tang, Huang, and Zhang 2020). For the LDAM loss, we also consider the results with DRW strategy, i.e., LDAM-DRW. (2) two-stage training methods. The cRT, LWS (Kang et al. 2019) and MiSLAS (Zhong et al. 2021) reuse the original training data. (3) generative approaches. Feature space augmentation (FSA) (Chu et al. 2020), Major-to-Minor (Kim, Jeong, and Shin 2020) and MetaSAug-LDAM (Li et al. 2021) synthesize new training data. We also involve SMOTE (Chawla et al. 2002), a widely-used generative method for mitigating oversampling issue. (4) other baselines, including OLTR(Liu et al. 2019), BBN (Zhou et al. 2020), SSP (Yang and Xu 2020), Bag of tricks (Zhang et al. 2021) and Hybrid-PSC (Wang et al. 2021).

Implementation Details.

In the experiments, we regard the most frequent classes occupying at least 60% of the total training instances as head classes, and the remaining classes as tail classes. The number of head classes $m$ selected for computing $S$ is chosen from $\{2,3\}$ . $\alpha$ , $\beta$ and $\tau$ are selected from $0.1\sim 0.2$ , $0.4\sim 1.3$ and $1.2\sim 1.3$ , respectively. Detailed parameter analysis can be referred to the Appendix.

For CIFAR10-LT and CIFAR100-LT, we use ResNet-32 (He et al. 2016) as backbone following the work (Cao et al. 2019). We first train the backbone with representation enhancement for 400 epochs with five warm up steps. The base learning rate is 0.1 and we conduct a multi-step learning rate schedule which decreases learning rate by 0.01 at the $320^{th}$ and $360^{th}$ epochs. In the second training stage, we freeze the backbone and train classifier with LADC for 30 epochs during which the learning rate drops by 0.1 at the $10^{th}$ and $20^{th}$ epochs.

For ImageNet-LT dataset, we choose ResNet-10 and ResNet-50 as the backbone and train for 200 epochs in the first stage. The learning rate is initialized as 0.2 and decreases by 0.1 at the $120^{th}$ and $160^{th}$ epochs. For the iNaturalist dataset, we train our approach for 100 epochs employing the cosine learning rate schedule (Loshchilov and Hutter 2016) for 100 epochs with ResNet-50 in the first stage. The training strategy in the second stage is similar as that in training the CIFAR datasets.

We use SGD optimizer with the momentum at 0.9 and weight decay at $5\cdot 10^{-4}$ for all the experiments. The experimentation is run on a single NVIDIA Tesla V100 with 32GB graphic memory.

Experiment Results

Main results on CIFAR

Table 2 shows the results on CIFAR10-LT and CIFAR100-LT. As can be seen, the proposed LADC consistently outperforms all the baselines. For the re-weighting approaches (i.e., the top 7 baselines), LADC increases the performance by at least 4.31% and 6.54% on average for CIFAR10-LT and CIFAR100-LT, respectively. Comparing with the best two-stage approach MiSLAS, LADC shows an improvement by 2.72% on CIFAR100-LT. LADC also performs better than the state-of-the-art generative approach MetaSAug, presenting an increase by 3.02% and 3.05% on average for the two datasets, respectively. The results indicate the effectiveness of the proposed approach in the long-tailed image classification.

Main results on ImageNet-LT

We conduct experiments on two backbones, i.e., ResNet-10 and ResNet-50. As shown in Table 3, LADC achieves the best performance among all the approaches for different backbones. Specifically, LADC outperforms the baselines by at least 0.47% and 1.13% for the two backbones, respectively. The results imply that representations learned by a larger model could better benefit LADC.

Table 3: Top-1 accuracy (%) on ImageNet-LT. Best results are marked in bold. BALAMs denotes Balanced Softmax loss with meta sampler.

Approach	ResNet-10	ResNet-50
Focal Loss	30.50	43.70
LDAM-DRW	40.73	48.80
BALAMs	41.80	50.04
OLTR	35.60	41.90
Remix	37.58	46.19
cRT	41.80	47.30
LWS	41.40	47.70
FSA	35.20	-
cRT+SSP	43.20	51.30
Bag of Tricks	43.13	-
MiSLAS	-	51.47
MetaSAug	-	50.52
LADC	43.67	52.60

Main Results on iNaturalist

Table 4 presents the results on large real-world dataset iNaturalist with ResNet-50. We can observe that LADC increases the accuracy of baseline approaches by 0.58% $\sim$ 8.21%, which shows the capability of LADC in handling large long-tailed datasets.

Main results on the text dataset THUNEWS-LT

We choose a commonly-used subset of Chinese news dataset THUCNews (Sun et al. 2016) which contains ten classes¹¹1https://github.com/649453932/Chinese-Text-Classification-Pytorch. We build its long-tailed version with the imbalance factor at 100 and use the pre-trained embedding model provided by Li et al. (2018). We choose three models including TextCNN (Kalchbrenner, Grefenstette, and Blunsom 2014), TextRNN (Liu, Qiu, and Huang 2016) and Transformer (Vaswani et al. 2017) as the backbone. We remove the representation enhancement component due to the inapplicability of the data augmentation techniques for the text data. The results illustrated in Table 5 show that LADC achieves the highest accuracy, increasing the best baselines by 0.5%, 1.49% and 0.99% on three models, respectively. The results demonstrate the generalizability and effectiveness of LADC in other long-tailed modals.

Table 4: Top-1 accuracy (%) on iNaturalist2018. All the models are trained up to 100 epochs.

Approach	ResNet-50
CE	57.16
CB-Focal	61.12
LDAM-DRW	68.00
BBN	66.29
Decoufound-TDE	65.20
cRT+SSP	68.10
Hybrid-PSC	68.10
MetaSAug	68.75
LADC	69.33

Table 5: Top-1 accuracy (%) on Chinese text classification. CB loss and BSCE denote Class Balanced loss and Balanced Softmax, respectively.

Approach	TextCNN	TextRNN	Transformer
CE	78.65	78.79	79.12
CB Loss	80.44	80.83	80.43
LWS	83.97	81.45	82.88
BSCE	84.28	81.06	83.02
LADC	84.78	82.94	84.01

Table 6: Ablation study on CIFAR-100-LT. RE: the representation enhancement module in Stage 1. Sampling: sampling from the distributions calibrated by LADC in Stage 2.

Module				CIFAR100-LT
RE	Sampling	LWS	LWS-Plus	200	100	50
✗	✗	✗	✗	34.70	37.92	44.02
✓	✗	✗	✗	38.12	42.30	46.72
✓	✓	✓	✗	45.29	49.22	54.00
✓	✓	✗	✓	46.62	50.77	54.94

Discussion

Ablation Study

We perform an ablation study to analyze the impact of each module in LADC. We follow the same two-stage training implementation throughout this study, and the results are shown in Table 6. Comparing the first two rows of the table, we can see that RE benefits the classification performance, as expected. In addition, we can observe that sampling from the calibrated distribution brings further improvement, as shown by the third row. Comparing the last two rows, we can achieve that the proposed LWS-Plus better rectifies the decision boundary than LWS.

Results Analysis

Following Wang et al. (2020), we divide the classes into three groups according to the number of training instances associated with each class, i.e., Many-shot ( $>$ 100 instances), Medium-shot ( $100\sim 20$ instances) and Few-shot ( $<$ 20 instances). The results of different groups are shown in Table 7. Comparing with the model trained by Cross-Entropy (CE), the baselines mainly focus on boosting the performance of the few-shot group with slight improvement on the medium-shot group. Besides, all the baselines expect for Remix show a performance degradation on the many-shot group. In contrast, the proposed LADC significantly enhances the accuracy on the few-shot and medium-shot groups, while achieving almost similar performance on the many-shot group as CE. The results indicate the effectiveness of LADC on both head and tail classes.

Table 7: Top-1 accuracy (%) of different groups on CIFAR100-100.

Approach	All	Many	Medium	Few
CE	37.9	65.2	36.3	8.0
Remix	39.7	69.4	37.9	7.2
OLTR	41.2	61.8	41.4	17.6
LDAM-DRW	42.0	61.5	41.7	20.2
$\tau$ -norm	41.4	58.8	38.1	24.7
cRT	43.3	64.0	44.8	18.1
LADC	50.8	64.5	52.6	32.5

Calibration strategy

To analyze the benefit of our distribution calibration strategy, we compare with some alternative variants of our method. One variant is LADC_Average, indicating calibrating distribution for the whole tail class, i.e., $\bm{\hat{\mu}}$ in Equ. (11) is the mean feature of the tail class. In addition, we compare with the DC method in the few-shot learning field (Yang, Liu, and Xu 2021). Note that the sampling strategy is the same for all the comparisons. The results are listed in Table 8. As can be seen, our proposed LADC significantly outperforms the DC method, which demonstrates that LADC is more effective than DC in the long-tailed scenario. In addition, comparing LADC with LADC_Average, we can achieve that instance-based distribution calibration can better estimate the distributions of classes than class-based calibration.

Comparison with different classifier re-balancing approaches in Stage 2.

Table 9 shows the results of different classifier re-balancing approaches in the second training stage. The experiments are also conducted on CIFAR100-LT. We choose class-balanced sampling (CBS), the state-of-the-art regularization method - label-aware smoothing (LAS) and our proposed label-aware distribution calibration combined with cRT and LWS, respectively, to re-balance the classifier in Stage 2. We can observe that LADC significantly improves the performance of class-balanced sampling method (CBS). Besides, LADC outperforms LAS under different training approaches in Stage 2. The results show the effectiveness of the proposed distribution calibration method.

Visualization of Generated Samples

Fig. 3 presents the visualization of feature distribution of training set and test set of CIFAR10-200. Visualization is conducted by adding a fully-connected layer ( $D*2$ where $D$ is the dimension of features) which projects feature into 2-dimension space following Liu et al. (2016) Fig. 3 (a) shows the original training set, where we can see the tail classes (e.g., label 7,8,9) have few training instances and their distributions are significantly different from those in the test set, as illustrated in Fig. 3 (c).

After sampling via LADC, we can observe in Fig. 3 (b) that the distributions of tail classes are obviously extended and more closer to the test set. This indicates that the difference between the distributions of the original training set and test set is relieved by sampled instances. Meanwhile, LADC can well preserve the distributions of head classes.

Table 8: Study on different distribution calibration strategies.

Approach	CIFAR100-LT
Approach	200	100	50	10
DC	44.10	48.35	51.69	62.59
LADC_Average	41.74	48.01	47.25	62.34
LADC (ours)	46.62	50.77	54.94	64.66

Table 9: Comparison on different classifier modification approach in Stage 2 given a backbone trained with representation enhancement module. CBS and LAS denote class-balanced sampling and label-aware smoothing, respectively.

Approach		CIFAR100-LT
Imbalance Factor		200	100	50
cRT	CBS	45.20	50.14	54.18
	LAS	45.87	50.02	53.10
	LADC	46.92	50.13	55.24
LWS	CBS	43.77	48.23	53.04
	LAS	44.02	48.21	52.59
	LADC	45.29	49.22	54.00

Related work

Re-sampling and Re-weighting. Re-sampling incorporates two approach types, i.e., over sampling tail classes (Shen, Lin, and Huang 2016; Buda, Maki, and Mazurowski 2018; Byrd and Lipton 2019) and under sampling head classes (Buda, Maki, and Mazurowski 2018; Japkowicz and Stephen 2002). Over sampling tail classes can augment training data of tail classes, which may cause severe over fitting problems on tail classes. Under sampling head classes is feasible to prevent head classes from dominating training process, however, it inevitably degrades the generalization ability of models for the reduction of head data.

Broadly speaking, a group of works aim to mitigate the issue of long-tail distribution by re-weighting training instances in the objective function (Cui et al. 2019; Tan et al. 2020; Lin et al. 2017; Ren et al. 2020; Cao et al. 2019; Tan et al. 2021). For example, Focal loss (Lin et al. 2017) determines the weights according to the model’s confidence of the training instances. Several other works determine the weights in inverse proportion to the class frequency. Cui et al. (2019) calculate the weights according to the effective number of each classes. Inspired by the generalization ability of adding margins, Cao et al. (2019) assign a label-aware margin to each class. While the re-weighting methods in Cui et al. (2019) and Cao et al. (2019) determine class weights in a pre-defined manner, meta-class-weight (Jamal et al. 2020) adopts a learnable class weights via meta-learning. Although these re-weighting methods enable the training of neural networks in a end-to-end fashion, they result in sub-optimal performance in comparison with two-stage training methods, potentially due to the distortion in representation learning caused by early re-weighting.

Two-stage training. Cao et al. (2019) propose a two-stage training methods, in which re-weighting or re-sampling is only introduced in the second stage. Decoupled learning is another form of two-stage training that is introduced by Kang et al. (2019) who claim that training on imbalance dataset biases the classifier instead of representation learning. In the first stage, Kang et al. (2019) train a backbone using a traditional cross-entropy loss. Then the backbone is fixed and the classifier is re-balanced by one of the following methods: classifier re-training (cRT), classifier normalization ( $\tau$ norm) and learnable classifier weight scaling (LWS). MiSLAS by Zhong et al. (2021) is also a decoupled learning approach, which utilizes label-aware smoothing to regularize the classifier. Although simple and effective, there are some issues not properly addressed by these two-stage methods. For example, class-balance sampling can still cause over-fitting to tail classes during the second training stage. Another issue is the distribution mismatch of tail classes. Our method aims to address these issues via label-aware distribution calibration and generative oversampling.

Data synthesis approach. Synthesizing new instances has been used widely to construct a balanced dataset. SMOTE proposed by Chawla et al. (2002) generates new data through convex combination of a data point and its neighbors. Based on SMOTE, several variants have been introduced (Han, Wang, and Mao 2005; Mullick, Datta, and Das 2019). Although they can potentially alleviate the issue of over-fitting, these methods cannot address the distribution mismatch issue of tail classes. More recently, there have been works proposed to generate data for tail classes relying on head classes. For example, M2m (Kim, Jeong, and Shin 2020) synthesizes data for tail classes via translating samples from head classes. And MetaSAug (Li et al. 2021) combines implicit semantic data augmentation and meta-learning to generate new training instances. These generative methods are computationally complicated and still need to combine re-weighting methods.

Conclusion

In this paper, we propose a novel label-aware distribution calibration (LADC) method for long-tailed classification. LADC can calibrate the distribution of tail classes by borrowing knowledge from well-represented head classes. Based on the decouple learning framework, we also explore representation enhancement and classifier re-balancing technique. Extensive experiments on several long-tailed image and text classification datasets demonstrate the effectiveness of LADC.

References

Buda, Maki, and Mazurowski (2018) Buda, M.; Maki, A.; and Mazurowski, M. A. 2018. A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks, 106: 249–259.
Byrd and Lipton (2019) Byrd, J.; and Lipton, Z. 2019. What is the effect of importance weighting in deep learning? In International Conference on Machine Learning, 872–881. PMLR.
Cao et al. (2019) Cao, K.; Wei, C.; Gaidon, A.; Arechiga, N.; and Ma, T. 2019. Learning imbalanced datasets with label-distribution-aware margin loss. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, 1567–1578.
Chawla et al. (2002) Chawla, N. V.; Bowyer, K. W.; Hall, L. O.; and Kegelmeyer, W. P. 2002. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16: 321–357.
Chou et al. (2020) Chou, H.-P.; Chang, S.-C.; Pan, J.-Y.; Wei, W.; and Juan, D.-C. 2020. Remix: Rebalanced Mixup. In European Conference on Computer Vision, 95–110. Springer.
Chu et al. (2020) Chu, P.; Bian, X.; Liu, S.; and Ling, H. 2020. Feature space augmentation for long-tailed data. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIX 16, 694–710. Springer.
Cubuk et al. (2020) Cubuk, E. D.; Zoph, B.; Shlens, J.; and Le, Q. V. 2020. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 702–703.
Cui et al. (2019) Cui, Y.; Jia, M.; Lin, T.-Y.; Song, Y.; and Belongie, S. 2019. Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 9268–9277.
Deng et al. (2009) Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248–255. Ieee.
Han, Wang, and Mao (2005) Han, H.; Wang, W.-Y.; and Mao, B.-H. 2005. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In International conference on intelligent computing, 878–887. Springer.
He et al. (2016) He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
Jamal et al. (2020) Jamal, M. A.; Brown, M.; Yang, M.-H.; Wang, L.; and Gong, B. 2020. Rethinking class-balanced methods for long-tailed visual recognition from a domain adaptation perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7610–7619.
Japkowicz and Stephen (2002) Japkowicz, N.; and Stephen, S. 2002. The class imbalance problem: A systematic study. Intelligent data analysis, 6(5): 429–449.
Kalchbrenner, Grefenstette, and Blunsom (2014) Kalchbrenner, N.; Grefenstette, E.; and Blunsom, P. 2014. A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188.
Kang et al. (2019) Kang, B.; Xie, S.; Rohrbach, M.; Yan, Z.; Gordo, A.; Feng, J.; and Kalantidis, Y. 2019. Decoupling representation and classifier for long-tailed recognition. arXiv preprint arXiv:1910.09217.
Kim, Jeong, and Shin (2020) Kim, J.; Jeong, J.; and Shin, J. 2020. M2m: Imbalanced classification via major-to-minor translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13896–13905.
Li et al. (2021) Li, S.; Gong, K.; Liu, C. H.; Wang, Y.; Qiao, F.; and Cheng, X. 2021. MetaSAug: Meta Semantic Augmentation for Long-Tailed Visual Recognition. arXiv preprint arXiv:2103.12579.
Li et al. (2018) Li, S.; Zhao, Z.; Hu, R.; Li, W.; Liu, T.; and Du, X. 2018. Analogical reasoning on chinese morphological and semantic relations. arXiv preprint arXiv:1805.06504.
Lin et al. (2017) Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; and Dollár, P. 2017. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, 2980–2988.
Liu, Qiu, and Huang (2016) Liu, P.; Qiu, X.; and Huang, X. 2016. Recurrent neural network for text classification with multi-task learning. arXiv preprint arXiv:1605.05101.
Liu et al. (2016) Liu, W.; Wen, Y.; Yu, Z.; and Yang, M. 2016. Large-Margin Softmax Loss for Convolutional Neural Networks. In ICML, volume 48 of JMLR Workshop and Conference Proceedings, 507–516. JMLR.org.
Liu et al. (2019) Liu, Z.; Miao, Z.; Zhan, X.; Wang, J.; Gong, B.; and Yu, S. X. 2019. Large-scale long-tailed recognition in an open world. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2537–2546.
Loshchilov and Hutter (2016) Loshchilov, I.; and Hutter, F. 2016. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983.
Menon et al. (2020) Menon, A. K.; Jayasumana, S.; Rawat, A. S.; Jain, H.; Veit, A.; and Kumar, S. 2020. Long-tail learning via logit adjustment. arXiv preprint arXiv:2007.07314.
Mullick, Datta, and Das (2019) Mullick, S. S.; Datta, S.; and Das, S. 2019. Generative adversarial minority oversampling. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1695–1704.
Ren et al. (2020) Ren, J.; Yu, C.; Sheng, S.; Ma, X.; Zhao, H.; Yi, S.; and Li, H. 2020. Balanced Meta-Softmax for Long-Tailed Visual Recognition. arXiv preprint arXiv:2007.10740.
Shen, Lin, and Huang (2016) Shen, L.; Lin, Z.; and Huang, Q. 2016. Relay backpropagation for effective learning of deep convolutional neural networks. In European conference on computer vision, 467–482. Springer.
Sun et al. (2016) Sun, M.; Li, J.; Guo, Z.; Yu, Z.; Zheng, Y.; Si, X.; and Liu, Z. 2016. Thuctc: an efficient chinese text classifier. GitHub Repository.
Tan et al. (2021) Tan, J.; Lu, X.; Zhang, G.; Yin, C.; and Li, Q. 2021. Equalization Loss v2: A New Gradient Balance Approach for Long-tailed Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1685–1694.
Tan et al. (2020) Tan, J.; Wang, C.; Li, B.; Li, Q.; Ouyang, W.; Yin, C.; and Yan, J. 2020. Equalization loss for long-tailed object recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 11662–11671.
Tang, Huang, and Zhang (2020) Tang, K.; Huang, J.; and Zhang, H. 2020. Long-Tailed Classification by Keeping the Good and Removing the Bad Momentum Causal Effect. Advances in Neural Information Processing Systems, 33.
Van Horn et al. (2018) Van Horn, G.; Mac Aodha, O.; Song, Y.; Cui, Y.; Sun, C.; Shepard, A.; Adam, H.; Perona, P.; and Belongie, S. 2018. The inaturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, 8769–8778.
Vaswani et al. (2017) Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In Advances in neural information processing systems, 5998–6008.
Wang et al. (2021) Wang, P.; Han, K.; Wei, X.-S.; Zhang, L.; and Wang, L. 2021. Contrastive Learning based Hybrid Networks for Long-Tailed Image Classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 943–952.
Wang et al. (2020) Wang, X.; Lian, L.; Miao, Z.; Liu, Z.; and Yu, S. 2020. Long-tailed Recognition by Routing Diverse Distribution-Aware Experts. In International Conference on Learning Representations.
Wang et al. (2019) Wang, Y.; Pan, X.; Song, S.; Zhang, H.; Huang, G.; and Wu, C. 2019. Implicit semantic data augmentation for deep networks. Advances in Neural Information Processing Systems, 32: 12635–12644.
Yang, Liu, and Xu (2021) Yang, S.; Liu, L.; and Xu, M. 2021. Free lunch for few-shot learning: Distribution calibration. arXiv preprint arXiv:2101.06395.
Yang and Xu (2020) Yang, Y.; and Xu, Z. 2020. Rethinking the value of labels for improving class-imbalanced learning. arXiv preprint arXiv:2006.07529.
Zhang et al. (2021) Zhang, Y.; Wei, X.-S.; Zhou, B.; and Wu, J. 2021. Bag of Tricks for Long-Tailed Visual Recognition with Deep Convolutional Neural Networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 3447–3455.
Zhong et al. (2021) Zhong, Z.; Cui, J.; Liu, S.; and Jia, J. 2021. Improving Calibration for Long-Tailed Recognition. arXiv preprint arXiv:2104.00466.
Zhou et al. (2020) Zhou, B.; Cui, Q.; Wei, X.-S.; and Chen, Z.-M. 2020. Bbn: Bilateral-branch network with cumulative learning for long-tailed visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9719–9728.