ABC: Auxiliary Balanced Classifier for Class-imbalanced Semi-supervised Learning

Hyuck Lee Seungjae Shin Heeyoung Kim
Department of Industrial and Systems Engineering, KAIST
{dlgur0921, tmdwo0910, heeyoungkim}@kaist.ac.kr

Abstract

Existing semi-supervised learning (SSL) algorithms typically assume class-balanced datasets, although the class distributions of many real-world datasets are imbalanced. In general, classifiers trained on a class-imbalanced dataset are biased toward the majority classes. This issue becomes more problematic for SSL algorithms because they utilize the biased prediction of unlabeled data for training. However, traditional class-imbalanced learning techniques, which are designed for labeled data, cannot be readily combined with SSL algorithms. We propose a scalable class-imbalanced SSL algorithm that can effectively use unlabeled data, while mitigating class imbalance by introducing an auxiliary balanced classifier (ABC) of a single layer, which is attached to a representation layer of an existing SSL algorithm. The ABC is trained with a class-balanced loss of a minibatch, while using high-quality representations learned from all data points in the minibatch using the backbone SSL algorithm to avoid overfitting and information loss. Moreover, we use consistency regularization, a recent SSL technique for utilizing unlabeled data in a modified way, to train the ABC to be balanced among the classes by selecting unlabeled data with the same probability for each class. The proposed algorithm achieves state-of-the-art performance in various class-imbalanced SSL experiments using four benchmark datasets.

1 Introduction

Recently, numerous deep neural network (DNN)-based semi-supervised learning (SSL) algorithms have been proposed to improve the performance of DNNs by utilizing unlabeled data when only a small amount of labeled data is available. These algorithms have shown effective performance in various tasks. However, most existing SSL algorithms assume class-balanced datasets, whereas the class distributions of many real-world datasets are imbalanced. It is well known that classifiers trained on class-imbalanced data tend to be biased toward the majority classes. This issue can be more problematic for SSL algorithms that use predicted labels of unlabeled data for their training, because the labels predicted by the algorithm trained on class-imbalanced data become even more severely imbalanced [18]. For example, Figure 1 (b) presents biased predictions of ReMixMatch [3], a recent SSL algorithm, trained on CIFAR- $10$ -LT, which is a class-imbalanced dataset with the amount of Class 0 being 100 times more than that of Class 9, as depicted in Figure 1 (a). Although there are various class-imbalanced learning techniques, they are usually designed for labeled data, and thus cannot be simply combined with SSL algorithms under class-imbalanced SSL (CISSL) scenarios. Recently, a few CISSL algorithms have been proposed, but the CISSL problem is still underexplored.

Refer to caption — Figure 1: Predictions on a class-balanced test set using ReMixMatch (b) and the proposed algorithm (c) trained on a class-imbalanced training set (a).

We propose a new CISSL algorithm that can effectively use unlabeled data, while mitigating class imbalance by using an existing DNN-based SSL algorithm [3, 29] as the backbone and introducing an auxiliary balanced classifier (ABC) of a single layer. The ABC is attached to a representation layer immediately preceding the classification layer of the backbone, based on the argument that a classification algorithm (i.e., backbone) can learn high-quality representations even if its classifier is biased toward the majority classes [17]. The ABC is trained to be balanced across all classes by using a mask that rebalances the class distribution, similar to re-sampling in previous SSL studies [2, 7, 13, 16]. Specifically, the mask stochastically regenerates a class-balanced subset of a minibatch on which the ABC is trained. The ABC is trained simultaneously with the backbone, so that the ABC can use high-quality representations learned from all data points in the minibatch using the backbone. In this way, the ABC can overcome the limitations of the previous resampling techniques, overfitting on minority-class data or loss of information on majority-class data [6, 27].

Moreover, to place decision boundaries in low-density regions by utilizing unlabeled data, we use consistency regularization, a recent SSL technique, which enforces the classification outputs of two augmented or perturbed versions of the same unlabeled example to remain unchanged. In particular, we encourage the ABC to be balanced across classes when using consistency regularization by selecting unlabeled examples with the same probability for each class using a mask. Figure 1 (c) illustrates that compared to the results of ReMixMatch in Figure 1 (b), the class distribution of the predicted labels becomes more balanced using the proposed algorithm trained on the same dataset. Our experimental results under various scenarios demonstrate that the proposed algorithm achieves state-of-the-art performance. Through qualitative analysis and an ablation study, we further investigate the contribution of each component of the proposed algorithm. The code for the proposed algorithm is available at https://github.com/LeeHyuck/ABC.

2 Related Work

Semi-supervised learning (SSL) Recently, several SSL techniques that utilize unlabeled data have been proposed. Entropy minimization [12] encourages the classifier outputs to have low entropy for unlabeled data, as in pseudo-labels [22]. Mixup regularization [4, 32] makes the decision boundaries farther away from the data clusters by encouraging the prediction for an interpolation of two inputs to be the same as the interpolation of the prediction for each input. Consistency regularization [26, 24, 30] encourages a classifier to produce similar predictions for perturbed versions of the same unlabeled input. To create perturbed unlabeled inputs, various data augmentation techniques have been used. For example, FixMatch [29] and ReMixMatch [3] used strong augmentation methods such as Cutout [10] and RandomAugment [8]. FixMatch and ReMixMatch are used as the backbone of the proposed algorithm; they are described in Section 3.2.

Class-imbalanced learning (CIL) As a popular approach for CIL, re-sampling techniques [16, 7, 2, 13] balance the number of training samples for each class in the training set. As another popular approach, re-weighting techniques [23, 14, 33] re-weight the loss for each class by a factor inversely proportional to the number of data points belonging to that class. Although these approaches are simple, they have some drawbacks. For example, oversampling from minority classes can cause overfitting, whereas undersampling from majority classes can cause information loss [6]. In the case of re-weighting, gradients can be calculated to be abnormally large when the class imbalance is severe, resulting in unstable training [6, 1]. Many attempts have been made to alleviate these problems, such as effective re-weighting [9] and meta-learning-based re-weighting [28, 15]. New forms of losses have also been proposed [6, 27]. In [36, 19], knowledge is transferred from the data of majority classes to the data of minority classes. These CIL algorithms were designed for labeled data and require label information; thus, they are not applicable to unlabeled data. In [17], it was found that biased classification is mainly due to the classification layer and that a classification algorithm can learn meaningful representations even from a class-imbalanced training set. Based on this finding, we design the ABC to use high-quality representations learned from class-imbalanced data utilizing FixMatch [29] and ReMixMatch [3].

Class-imbalanced semi-supervised learning (CISSL) There have been few studies on CISSL. In [35], it was found that more accurate decision boundaries can be obtained in class-imbalanced settings through self-supervised learning and semi-supervised learning. DARP [18] refines biased pseudo-labels by solving a convex optimization problem. CReST [34], a recent self-training technique, mitigates class imbalance by using pseudo-labeled unlabeled data points classified as minority classes with a higher probability than those classified as majority classes.

3 Methodology

3.1 Problem setting

Suppose that we have a labeled dataset $\mathcal{X}=\left\{\left(x_{n},y_{n}\right):n\in\left(1,...,N\right)\right\}$ , where $x_{n}\in\mathbb{R}^{d}$ is the $n$ th labeled data point and $y_{n}\in\left\{1,\ldots,L\right\}$ is the corresponding label. We also have an unlabeled dataset $\mathcal{U}=\left\{\left(u_{m}\right):m\in\left(1,\ldots,M\right)\right\}$ , where $u_{m}\in\mathbb{R}^{d}$ is the $m$ th unlabeled data point. We express the ratio of the amount of labeled data as $\beta=\frac{N}{M+N}$ . Generally, $\beta<0.5$ , because label acquisition is costly and laborious. We denote the number of labeled data points of class $l$ as $N_{l}$ , i.e., $\sum_{l=1}^{L}N_{l}=N$ , and assume that the $L$ classes are sorted according to cardinality in descending order, i.e., $N_{1}\geq N_{2}\geq\cdots\geq N_{L}$ . We denote the ratio of the class imbalance as $\gamma=\frac{N_{1}}{N_{L}}$ . Under class-imbalanced scenarios, $\gamma\gg 1$ . Following previous CIL studies, we define the half of the classes containing a large amount of data as the majority classes, and the other half of the classes, containing a small amount of data, as the minority classes. Following [34], we assume that $\mathcal{X}$ and $\mathcal{U}$ share the same class distribution, i.e., the labeled and unlabeled datasets are class-imbalanced to the same extent. From $\mathcal{X}$ and $\mathcal{U}$ , we generate minibatches $\mathcal{MB}_{\mathcal{X}}=\left\{\left(x_{b},y_{b}\right):b\in\left(1,...,B\right)\right\}\subset\mathcal{X}$ and $\mathcal{MB}_{\mathcal{U}}=\left\{\left(u_{b}\right):b\in\left(1,\ldots,B\right)\right\}\subset\mathcal{U}$ for each iteration of training, where $B$ is the minibatch size. Using these minibatches for training, we aim to learn a model $f:\mathbb{R}^{d}\rightarrow\left\{1,...L\right\}$ that performs effectively on a class-balanced test set.

3.2 Backbone SSL algorithm

We attach the ABC to the backbone’s representation layer, so that it can utilize the high-quality representations learned by the backbone. We use FixMatch [29] or ReMixMatch [3] as the backbone, as these two have achieved state-of-the-art SSL performance. FixMatch uses the classification loss calculated from the weakly augmented labeled data point $\alpha\left(x_{b}\right)$ generated by flipping and cropping the image, and the consistency regularization loss calculated from the weakly augmented unlabeled data point $\alpha\left(u_{b}\right)$ and strongly augmented unlabeled data point $\mathcal{A}\left(u_{b}\right)$ generated by Cutout [10] and RandomAugment [8]. ReMixMatch predicts the class label of the weakly augmented unlabeled data point $\alpha\left(u_{b}\right)$ using distribution alignment and sharpening, and assigns the predicted label to the strongly augmented unlabeled data point $\mathcal{A}\left(u_{b}\right)$ . These strongly augmented unlabeled data point $\mathcal{A}\left(u_{b}\right)$ and strongly augmented labeled data point $\mathcal{A}\left(x_{b}\right)$ are used to conduct mixup regularization. ReMixMatch also conducts consistency regularization in a manner similar to FixMatch and self-supervised learning using the rotation of the image [11, 39]. FixMatch and ReMixMatch have greatly improved the SSL performance by learning high-quality representations using strong data augmentation. However, these algorithms can be significantly biased toward the majority classes in class-imbalanced settings.

Using FixMatch and ReMixMatch as the backbone of the proposed algorithm, we ensure that the ABC enjoys high-quality representations learned by the backbone, while replacing the backbone’s biased classifier. To train the ABC, we reuse the weakly augmented data and strongly augmented data used by the backbone to decrease the computational cost. Although we use FixMatch and ReMixMatch as the backbone in this study, the ABC can also be combined with other DNN-based SSL algorithms, as long as they use weakly augmented data and strongly augmented data.

3.3 ABC for class-imbalanced Semi-supervised learning

To train the ABC to be balanced, we first generate $0/1$ mask $M\left(x_{b}\right)$ for each labeled data point $x_{b}$ using a Bernoulli distribution $\mathcal{B}\left(\cdot\right)$ with the parameter set to be inversely proportional to the number of data points of each class. This setting makes $\mathcal{B}\left(\cdot\right)$ generate mask 1 with high probability for the data points in the minority classes, but with low probability for those in the majority classes. Then, the classification loss is multiplied by the generated mask, so that the ABC can be trained with a balanced classification loss. Multiplying the classification loss by the $0/1$ mask can be interpreted as oversampling of the data points in the minority classes, whereas it can be interpreted as undersampling of those in the majority classes. In representation learning, oversampling and undersampling techniques have shown overfitting and information loss problems, respectively. In contrast, the ABC can overcome these problems because it uses the representations learned by the backbone, which is trained on all data points in the minibatch. The use of the $0/1$ mask to construct the balanced loss, instead of directly creating a balanced subset, allows the backbone and the ABC to be trained from the same minibatches. Therefore, the representations of minibatches calculated for training the backbone can be used again for training the ABC. Consequently, the proposed algorithm only requires a slightly increased time cost compared to training the backbone alone. This is confirmed in Section 4.3. The overall procedure of balanced training with $0/1$ mask for the ABC attached to a representation layer of the backbone is presented in Figure 2. The classification loss for the ABC, $L_{cls}$ , with $0/1$ mask $M\left(\cdot\right)$ is expressed as

\displaystyle L_{cls}=\frac{1}{B}\sum_{b=1}^{B}M\left(x_{b}\right)\mathbf{H}\left(p_{s}\left(y|\alpha\left(x_{b}\right)\right),p_{b}\right),

(1)

\displaystyle M\left(x_{b}\right)=\mathcal{B}\left(\frac{N_{L}}{N_{y_{b}}}\right),

(2)

where $\mathbf{H}$ is the standard cross-entropy loss, $\alpha\left(x_{b}\right)$ is an augmented labeled data point, $p_{s}\left(y|\alpha\left(x_{b}\right)\right)$ is the predicted class distribution using the ABC for $\alpha\left(x_{b}\right)$ , and $p_{b}$ is the one-hot label for $x_{b}$ .

3.4 Consistency regularization for ABC

To increase the margin between the decision boundary and the data points using unlabeled data, we conduct consistency regularization for the ABC, similar to the way in FixMatch. Specifically, we first obtain the predicted class distribution $p_{s}\left(y|\alpha\left(u_{b}\right)\right)$ for a weakly augmented unlabeled data point $\alpha\left(u_{b}\right)$ using the ABC and use it as a soft pseudo-label $q_{b}$ . Then, for two strongly augmented unlabeled data points $\mathcal{A}_{1}\left(u_{b}\right)$ and $\mathcal{A}_{2}\left(u_{b}\right)$ , we train the ABC to produce their predicted class distributions, $p_{s}\left(y|\mathcal{A}_{1}\left(u_{b}\right)\right)$ and $p_{s}\left(y|\mathcal{A}_{2}\left(u_{b}\right)\right)$ , to be close to $q_{b}$ .

In class-imbalanced settings, because most unlabeled data points belong to majority classes, most weakly augmented unlabeled data points can be predicted as the majority classes. Then, consistency regularization would be conducted with a higher frequency for the majority classes, which can cause a classifier to be biased toward the majority classes. To prevent this issue, we conduct consistency regularization in a modified manner that is suitable for class-imbalance problems. Specifically, whereas FixMatch minimizes entropy by converting the predicted class distribution for a weakly augmented data point into a one-hot pseudo-label, we directly use the predicted class distribution as a soft pseudo-label. We do not pursue entropy minimization for the ABC because it can accelerate biased classification toward certain classes. Moreover, we once again generate $0/1$ mask $M\left(\cdot\right)$ for each unlabeled data point $u_{b}$ based on a soft pseudo label $q_{b}$ , and multiply the consistency regularization loss for $u_{b}$ by the generated mask, so that the ABC can be trained with a class-balanced consistency regularization loss. Note that existing resampling techniques are not applicable to unlabeled data, because they require a label for each data point. In contrast, we make it possible to resample unlabeled data by using the soft pseudo-label and the $0/1$ mask. The consistency regularization loss, $L_{con}$ , with $0/1$ mask $M\left(\cdot\right)$ is expressed as

\displaystyle L_{con}=\frac{1}{B}\sum_{b=1}^{B}\sum_{k=1}^{2}M\left(u_{b}\right)\mathbf{I}\left(\max\left(q_{b}\right)\geq\tau\right)\mathbf{H}\left(p_{s}\left(y|\mathcal{A}_{k}\left(u_{b}\right)\right),q_{b}\right),

(3)

\displaystyle M\left(u_{b}\right)=\mathcal{B}\left(\frac{N_{L}}{N_{\widehat{q}_{b}}}\right),

(4)

where $\mathbf{I}$ is the indicator function, $\max\left(q_{b}\right)$ is the highest predicted assignment probability for any class, representing the confidence of prediction, and $\tau$ is the confidence threshold. To avoid the unwanted effects of inaccurate soft pseudo-labels during consistency regularization, we only use the weakly augmented unlabeled data points whose confidence is higher than the threshold $\tau$ , similar to that in FixMatch. To take full advantage of few unlabeled data points with prediction confidence values that are higher than the confidence threshold $\tau$ in the early stage of training, we gradually decrease the parameter of the Bernoulli distribution $\mathcal{B}\left(\cdot\right)$ for $u_{b}$ from 1 to ${N_{L}}/{N_{\widehat{q}_{b}}}$ , where $\widehat{q}_{b}$ is the one-hot pseudo-label obtained from $q_{b}$ . Following previous studies [3, 24, 29, 4], we do not backpropagate gradients for pseudo-label prediction. The overall procedure for consistency regularization for the ABC is shown in Appendix A.

3.5 End-to-end training

Unlike a recent CIL trend to finetune a classifier in a balanced manner after representation learning is completed (i.e., decoupled learning of representations and a classifier) [17, 27], we obtain a balanced classifier by training the proposed algorithm end-to-end. We train the proposed algorithm with the sum of losses from Sections 3.3 and 3.4, and the loss for the backbone, $L_{back}$ . The total loss function $L_{total}$ is expressed as

\displaystyle L_{total}=L_{cls}+L_{con}+L_{back}.

(5)

Whereas we use the sum of the losses for the backbone and ABC for training the proposed algorithm, we predict the class labels of new data points using only the ABC. In our experiments in Sections 4.4 and 4.5, we show that the proposed algorithm trained end-to-end produces better performance than competing algorithms with decoupled learning of representations and a classifier, and we analyze possible reasons. We present the pseudo code of the proposed algorithm in Appendix B.

4 Experiments

4.1 Experimental setup

We created class-imbalanced versions of CIFAR-10, CIFAR-100 [21], and SVHN [25] datasets to conduct experiments under various ratios of class imbalance $\gamma$ and various ratios of the amount of labeled data $\beta$ . For class-imbalance types, we first consider long-tailed (LT) imbalance in which the number of data points exponentially decreases from the largest to the smallest class, i.e., $N_{k}=N_{1}\times\gamma^{-\frac{k-1}{L-1}}$ , where $\gamma=\frac{N_{1}}{N_{L}}$ . We also consider step imbalance [5] in which the whole majority classes have the same amount of data and the whole minority classes also have the same amount of data. Two types of class imbalance for the considered datasets are illustrated in Appendix C. For the main setting, we set $\gamma=100$ , $N_{1}=1000$ , and $\beta=20\%$ for CIFAR- $10$ and SVHN, and $\gamma=20$ , $N_{1}=200$ and $\beta=40\%$ for CIFAR- $100$ . Similar to [18], we set $\gamma$ of CIFAR-100 to be relatively small because CIFAR- $100$ has only 500 training data points for each class. To evaluate the performance of the proposed algorithm on large-scale datasets, we also conducted experiments on 7.5M data points of 256 by 256 images from the LSUN dataset [37].

We compared the performance of the proposed algorithm with that of various baseline algorithms. Specifically, we considered the following baseline algorithms:

•

Deep CNN (vanilla algorithm): This is trained on only labeled data with the cross-entropy loss.
•

BALMS [27] (CIL algorithm): This state-of-the-art CIL algorithm does not use unlabeled data.
•

VAT [24], ReMixMatch [3], and FixMatch [29] (SSL algorithms): These are state-of-the-art SSL algorithms, but do not consider class imbalance.
•

FixMatch+CReST+PDA and ReMixMatch+CReST+PDA (CISSL algorithms): CReST+PDA [34] mitigates class imbalance by using unlabeled data points classified as the minority classes with a higher probability than those classified as the majority classes.
•

ReMixMatch+DARP and FixMatch+DARP (CISSL algorithms): These algorithms use DARP [18] to refine the pseudo labels obtained from ReMixMatch or FixMatch.
•

ReMixMatch+DARP+cRT and FixMatch+DARP+cRT (CISSL algorithms): Compared to ReMixMatch+DARP and FixMatch+DARP, these algorithms finetune the classifier using cRT [17].

For the structure of the deep CNN used in the proposed and baseline algorithms, we used Wide ResNet- $28$ - $2$ [38]. We trained the proposed algorithm for $250,000$ iterations with a batch size of 64. The confidence threshold $\tau$ was set to $0.95$ based on experiments with various values of $\tau$ in Appendix D. We used the Adam optimizer [20] with a learning rate of $0.002$ , and used Cutout [10] and RandomAugment [8] for strong data augmentation, following [18]. Similar to [3, 4], we evaluated the performance of the proposed algorithm using an exponential moving average of the parameters over iterations with a decay rate of $0.999$ , instead of scheduling the learning rate. In Tables 1-5, we used the overall accuracy and the accuracy only for minority classes as performance measures. We repeated the experiments five times under the main setting, and three times under the step imbalance and other settings of $\beta$ and $\gamma$ . We report the average and standard deviation of the performance measures over repeated experiments. For the vanilla algorithm, FixMatch+DARP+cRT, and ReMixMatch+DARP+cRT, which suffered from overfitting, we measured performance every 500 iterations and recorded the best performance. Further details of the experimental setup are described in Appendix E.

4.2 Experimental results

The performance of the competing algorithms under the main setting are summarized in Table 1. We can observe that the proposed algorithm achieved the highest overall performance, with improved performance for minority classes. Interestingly, VAT, an SSL algorithm, showed similar performance to the vanilla algorithm, and worse performance than BALMS, a CIL algorithm. Similarly, FixMatch and ReMixMatch, which do not consider class imbalance, showed poor performance for minority classes. Although BALMS mitigated class imbalance, it produced poor overall performance, as it did not use unlabeled data for training. This demonstrates the importance of using unlabeled data for training, even in the class-imbalanced setting. FixMatch+CReST+PDA and ReMixMatch+CReST+PDA mitigated class imbalance by using unlabeled data points classified as the minority classes with a higher probability, but produced lower performance than the proposed algorithm. This may be because even if all unlabeled data points classified as minority classes are additionally used for training, their amount is still less than that of the data in majority classes, while the proposed algorithm uses class-balanced minibatches by generating the $0/1$ mask. Fixmatch+DARP and ReMixMatch+DARP slightly mitigated class imbalance by refining biased pseudo-labels, but resulted in lower performance than the proposed algorithm. This may be because even perfect pseudo labels cannot change the underlying class-imbalanced distribution of the training data. By additionally using a rebalancing technique cRT, FixMatch(ReMixMatch)+DARP+cRT performed better than FixMatch(ReMixMatch)+DARP. However, FixMatch(ReMixMatch)+DARP+cRT still performed worse than FixMatch(ReMixMatch)+ABC, although it also uses high-quality representations learned by FixMatch(ReMixMatch) and techniques for mitigating class imbalance. The superior performance of FixMatch(ReMixMatch)+ABC over FixMatch(ReMixMatch)+DARP+cRT is probably because FixMatch(ReMixMatch)+ABC was trained end-to-end, and the ABC was also trained using unlabeled data. We discuss this in more detail in Sections 4.4 and 4.5. Overall, the algorithms combined with ReMixMatch performed better than the algorithms combined with FixMatch. In addition to the overall accuracy and minority-class-accuracy, we also compared the performance of the competing algorithms in terms of the geometric mean (G-mean) of class-wise accuracy under the main setting in Appendix F.

Table 1: Overall accuracy/minority-class-accuracy under the main setting

	CIFAR- $10$ -LT	SVHN-LT	CIFAR- $100$ -LT
Algorithm	$\gamma=100$ , $\beta=20\%$	$\gamma=100$ , $\beta=20\%$	$\gamma=20$ , $\beta=40\%$
Vanilla	$55.3$ $\pm 1.30$ / $33.9$ $\pm 1.88$	$77.0$ $\pm 0.67$ / $63.3$ $\pm 1.25$	$40.1$ $\pm 1.15$ / $25.2$ $\pm 0.95$
VAT [24]	$55.3$ $\pm 0.88$ / $28.2$ $\pm 1.55$	$81.3$ $\pm 0.47$ / $68.2$ $\pm 0.88$	$40.4$ $\pm 0.34$ / $24.8$ $\pm 0.38$
BALMS [27]	$70.7$ $\pm 0.59$ / $69.8$ $\pm 1.03$	$87.6$ $\pm 0.53$ / $85.0$ $\pm 0.67$	$50.2$ $\pm 0.54$ / $42.9$ $\pm 1.03$
FixMatch [29]	$72.3$ $\pm 0.33$ / $53.8$ $\pm 0.63$	$88.0$ $\pm 0.30$ / $79.4$ $\pm 0.54$	$51.0$ $\pm 0.20$ / $32.8$ $\pm 0.41$
w/ CReST+PDA [34]	$76.6$ $\pm 0.46$ / $61.4$ $\pm 0.85$	$89.1$ $\pm 0.69$ / $81.7$ $\pm 1.18$	$51.6$ $\pm 0.29$ / $36.4$ $\pm 0.46$
w/ DARP [18]	$73.7$ $\pm 0.98$ / $57.0$ $\pm 2.12$	$88.6$ $\pm 0.19$ / $80.5$ $\pm 0.54$	$51.4$ $\pm 0.37$ / $33.9$ $\pm 0.77$
w/ DARP+cRT [18]	$78.1$ $\pm 0.89$ / $66.6$ $\pm 1.55$	$89.9$ $\pm 0.44$ / $83.5$ $\pm 0.61$	$54.7$ $\pm 0.46$ / $41.2$ $\pm 0.42$
w/ ABC	81.1 $\pm 0.82$ / 72.0 $\pm 1.77$	92.0 $\pm 0.38$ / 87.9 $\pm 0.73$	56.3 $\pm 0.19$ / 43.4 $\pm 0.42$
ReMixMatch [3]	$73.7$ $\pm 0.39$ / $55.9$ $\pm 0.87$	$89.8$ $\pm 0.42$ / $82.8$ $\pm 0.68$	$54.0$ $\pm 0.29$ / $37.1$ $\pm 0.37$
w/ CReST+PDA [34]	$75.7$ $\pm 0.34$ / $59.6$ $\pm 0.76$	$90.9$ $\pm 0.20$ / $85.2$ $\pm 0.39$	$54.6$ $\pm 0.48$ / $38.1$ $\pm 0.69$
w/ DARP [18]	$74.4$ $\pm 0.41$ / $56.9$ $\pm 0.67$	$90.2$ $\pm 0.22$ / $83.5$ $\pm 0.40$	$54.5$ $\pm 0.33$ / $37.7$ $\pm 0.58$
w/ DARP+cRT [18]	$78.5$ $\pm 0.61$ / $66.4$ $\pm 1.68$	$92.1$ $\pm 0.48$ / $87.6$ $\pm 0.75$	$55.1$ $\pm 0.45$ / $43.6$ $\pm 0.58$
w/ ABC	82.4 $\pm 0.45$ / 75.7 $\pm 1.18$	93.9 $\pm 0.16$ / 92.5 $\pm 0.4$	57.6 $\pm 0.26$ / 46.7 $\pm 0.50$

To evaluate the performance of the proposed algorithm in various settings, we conducted experiments using ReMixMatch, FixMatch, and the CISSL algorithms considered in Table 1, while changing the ratio of class imbalance $\gamma$ and the ratio of the amount of labeled data $\beta$ . The results for CIFAR- $10$ are presented in Table 2, and the results for SVHN and CIFAR- $100$ are presented in Appendix G. In Table 2, we can observe that the proposed algorithm achieved the highest overall accuracy with greatly improved performance for minority classes for all settings. Because FixMatch+DARP+cRT and ReMixMatch+DARP+cRT do not use unlabeled data for classifier tuning, the difference in performance between FixMatch(ReMixMatch)+DARP+cRT and the proposed algorithm increased as the ratio of the amount of labeled data $\beta$ decreased and as the ratio of class imbalance $\gamma$ increased. In addition, the difference in performance between FixMatch(ReMixMatch)+CReST+PDA and the proposed algorithm tended to increase as the ratio of class imbalance $\gamma$ increased, because the difference between the number of labeled data points belonging to the majority classes and the number of unlabeled data points classified as the minority classes increases with $\gamma$ .

Table 2: Overall accuracy/minority-class accuracy for CIFAR-

10

under various settings

CIFAR- $10$ -LT
Algorithm	$\gamma=100$ , $\beta=10\%$	$\gamma=100$ , $\beta=30\%$	$\gamma=50$ , $\beta=20\%$	$\gamma=150$ , $\beta=20\%$
FixMatch [29]	$70.0$ $\pm 0.59$ / $48.9$ $\pm 1.04$	$74.9$ $\pm 0.63$ / $58.2$ $\pm 1.28$	$81.2$ $\pm 0.07$ / $70.7$ $\pm 0.36$	$68.5$ $\pm 0.60$ / $45.8$ $\pm 1.15$
w/ CReST+PDA [34]	$73.9$ $\pm 0.40$ / $58.9$ $\pm 1.14$	$77.6$ $\pm 0.73$ / $64.0$ $\pm 1.39$	$83.3$ $\pm 0.10$ / $75.7$ $\pm 0.39$	$70.0$ $\pm 0.82$ / $49.4$ $\pm 1.52$
w/ DARP+cRT [18]	$74.6$ $\pm 0.98$ / $59.2$ $\pm 2.12$	$79.0$ $\pm 0.25$ / $67.7$ $\pm 0.95$	$83.6$ $\pm 0.42$ / $77.1$ $\pm 1.19$	$73.2$ $\pm 0.85$ / $57.1$ $\pm 1.13$
w/ ABC	77.2 $\pm 1.60$ / 65.7 $\pm 2.85$	81.5 $\pm 0.29$ / 72.9 $\pm 0.96$	85.2 $\pm 0.51$ / 80.2 $\pm 0.64$	77.1 $\pm 0.46$ / 64.4 $\pm 0.92$
ReMixMatch [3]	$71.5$ $\pm 0.51$ / $52.2$ $\pm 1.08$	$75.8$ $\pm 0.10$ / $59.4$ $\pm 0.17$	$81.5$ $\pm 0.17$ / $70.7$ $\pm 0.32$	$69.9$ $\pm 0.23$ / $48.4$ $\pm 0.60$
w/ CReST+PDA [34]	$73.8$ $\pm 0.32$ / $56.6$ $\pm 0.43$	$78.6$ $\pm 0.73$ / $64.8$ $\pm 1.49$	$83.9$ $\pm 0.26$ / $75.4$ $\pm 0.52$	$71.3$ $\pm 0.77$ / $50.8$ $\pm 1.59$
w/ DARP+cRT [18]	$75.9$ $\pm 1.20$ / $62.1$ $\pm 3.10$	$81.0$ $\pm 0.16$ / $70.7$ $\pm 0.72$	$84.5$ $\pm 0.80$ / $77.8$ $\pm 1.67$	$73.9$ $\pm 0.59$ / $57.4$ $\pm 1.45$
w/ ABC	79.8 $\pm 0.36$ / 70.8 $\pm 0.92$	84.3 $\pm 1.03$ / 80.6 $\pm 0.97$	87.5 $\pm 0.31$ / 84.6 $\pm 1.19$	80.6 $\pm 0.66$ / 72.1 $\pm 1.51$

We also conducted experiments under a step-imbalance setting, where the class imbalance was more noticeable. This setting assumes a more severely imbalanced class distribution than the LT imbalance settings, because half of the classes have very scarce data. The experimental results for CIFAR- $10$ are presented in Table 3, and the results for SVHN and CIFAR- $100$ are presented in Appendix H. In Table 3, we can see that the proposed algorithm achieved the best performance, and the performance margin is greater than that of the LT imbalance settings. ReMixMatch+CReST+PDA showed relatively low performance compared to the other algorithms.

Table 3: Overall accuracy/minority-class accuracy on CIFAR-

10

under a step imbalance setting

CIFAR- $10$ -Step, $\gamma=100$ , $\beta=20\%$
Algorithm	w/ -	w/ CReST+PDA [34]	w/ DARP+cRT [18]	w/ ABC
FixMatch [29]	$54.0$ $\pm 0.84$ / $11.8$ $\pm 1.71$	$71.1$ $\pm 0.78$ / $48.2$ $\pm 2.26$	$69.8$ $\pm 1.51$ / $45.1$ $\pm 2.70$	75.9 $\pm 0.49$ / 57.0 $\pm 1.07$
ReMixMatch [3]	$60.8$ $\pm 0.10$ / $25.1$ $\pm 1.28$	$64.6$ $\pm 0.97$ / $33.5$ $\pm 2.05$	$72.3$ $\pm 1.77$ / $50.6$ $\pm 3.53$	76.4 $\pm 1.70$ / 65.7 $\pm 1.30$

To evaluate the performance of the proposed algorithm on a large-scale dataset, we also conducted experiments on the LSUN dataset [37], which is naturally a long-tailed dataset. Among the algorithms considered in Tables 2 and 3, those combined with CReST were excluded for comparison, because CReST requires loading of the whole unlabeled data in the repeated process of updating pseudo-labels, which is not possible for the large-scale LSUN dataset. Instead, we additionally considered FixMatch+cRT and ReMixMatch+cRT for comparison. The experimental results are presented in Table 4. The proposed algorithm showed better performance than the other baseline algorithms. DARP resulted in degradation of the performance, possibly because the scale of the LSUN dataset is very large. Specifically, DARP solves a convex optimization with all unlabeled data points to refine the pseudo labels. As the scale of the unlabeled dataset increases, this optimization problem becomes more difficult to solve and, consequently, the pseudo-labels could be refined inaccurately. Unlike the results for other datasets, the algorithms combined with FixMatch performed better than the algorithms combined with ReMixMatch.

Table 4: Overall accuracy/minority-class accuracy for the large-scale LSUN dataset

LSUN, $\gamma=100$ , $\beta=20\%$
Algorithm	w/ -	w/ cRT [17]	w/ DARP [18]	w/ DARP+cRT [18]	w/ ABC
FixMatch [29]	$73.1$ / $55.3$	$77.0$ / $71.5$	$71.0$ / $51.8$	$75.8$ / $69.5$	78.9 / 75.5
ReMixMatch [3]	$69.4$ / $49.1$	$75.4$ / $69.5$	$65.6$ / $44.1$	$72.1$ / $67.5$	76.9 / 69.5

4.3 Complexity of the proposed algorithm

The proposed algorithm requires additional parameters for the ABC, but the number of the additional parameters is negligible compared to the number of parameters of the backbone. For example, the ABC additionally required only $0.09\%$ and $0.87\%$ of the number of backbone parameters for CIFAR- $10$ with $10$ classes and CIFAR- $100$ with $100$ classes, respectively. Moreover, because the ABC shares the representation layer of the backbone, it does not significantly increase the memory usage and training time. Furthermore, we could train the proposed algorithm on the large-scale LSUN dataset without a significant increase in computation cost, because the entire training procedure could be carried out using minibatches of data. In contrast, the algorithms combined with DARP required convex optimization for all pseudo-labels, which significantly increased the computation cost as the number of classes or the amount of data increased. Similarly, it required significant time to train the algorithms combined with CReST, because CReST requires iterative re-training with a labeled set expanded by adding unlabeled data points with pseudo-labels. We present the floating point operations per second (FLOPS) for each algorithm using Nvidia Tesla-V100 in Appendix I.

4.4 Qualitative analysis of high-quality representations and balanced classification

The ABC can use high-quality representations learned by the backbone when performing balanced classification. To verify this, in Figure 3, we present t-distributed stochastic neighbor embedding (t-SNE) [31] of the representations of the CIFAR- $10$ test set learned by the ABC (without SSL backbone), FixMatch+ABC, and ReMixMatch+ABC on CIFAR- $10$ -LT under the main setting. Different colors indicate different classes. As expected, “ABC (without SSL backbone)" failed to learn class-separable representations because sufficient data were not used for training while using the $0/1$ mask. In contrast, by training the backbone (FixMatch or ReMixMatch) together with the ABC, the proposed algorithm could use the entire data and learn high-quality representations. In this example, ReMixMatch produced more separable representations than FixMatch, which shows that the choice of the backbone affects the performance of the proposed algorithm, as expected.

The proposed algorithm can also mitigate class imbalance by using the ABC. To verify this, we compare the confusion matrices of the predictions on the test set of CIFAR- $10$ using ReMixMatch, ReMixMatch+DARP+cRT, and ReMiMatch+ABC trained on CIFAR- $10$ under the main setting in Figure 4. In the confusion matrices, the value in the $i$ th row and the $j$ th column represents the ratio of the amount of data belonging to the $i$ th class to the amount of data predicted as the $j$ th class. Each cell has a darker red color when the ratio is larger. We can see that ReMixMatch often misclassified data points in the minority classes (e.g., classes $8$ and $9$ into classes $0$ and $1$ ). This may be because ReMixMatch does not consider class imbalance, and thus biased pseudo-labels were used for training. ReMixMatch+DARP+cRT produced a more balanced class-distribution compared to ReMixMatch by additionally using DARP+cRT. However, a significant number of data points in the minority classes were still misclassified as majority classes. In contrast, ReMixMatch+ABC classified the test data points in the minority classes with higher accuracy, and produced a significantly more balanced class distribution than ReMixMatch+DARP+cRT, as shown in Figure 4 (c). As both ReMixMatch+DARP+cRT and ReMixMatch+ABC use ReMixMatch to learn representations, the performance gap between these two algorithms results from the different characteristics of the ABC versus DARP+cRT as follows. First, DARP+cRT does not use unlabeled data for training its classifier after representations learning is completed, whereas the ABC uses unlabeled data with unbiased pseudo-labels for its training. Second, whereas DARP+cRT decouples the learning of representations and training of a classifier, the ABC is trained end-to-end interactively with representations learned by the backbone. We also present the confusion matrices of the predictions on the test set of CIFAR- $10$ using FixMatch, FixMatch+DARP+cRT, and FixMatch+ABC as well as the confusion matrices of the pseudo-labels on the same dataset using ReMixMatch, ReMixMatch+DARP+cRT, ReMixMatch+ABC, FixMatch, FixMatch+DARP+cRT, and FixMatch+ABC in Appendix J. Moreover, we compare the ABC and the classifier of DARP+cRT in more detail using the validation loss plots in Appendix K.

4.5 Ablation study

We conducted an ablation study on CIFAR- $10$ -LT in the main setting to investigate the effect of each element of the proposed algorithm. The results for ReMixMatch+ABC are presented in Table 5, where each row indicates the proposed algorithm with the described conditions in that row. The results are summarized as follows. 1) If we did not gradually decrease the parameter of the Bernoulli distribution $\mathcal{B}\left(\cdot\right)$ when conducting consistency regularization, then an overbalance problem occurred because of unlabeled data misclassified as minority classes. 2) Without consistency regularization for the ABC, the decision boundary did not clearly separate each class. 3) Without using the $0/1$ mask for $L_{cls}$ and $L_{con}$ , the ABC was trained to be biased toward the majority classes. 4) Without confidence threshold $\tau$ for consistency regularization, training became unstable and, consequently, the ABC was trained to be biased toward certain classes. 5) Similarly, if hard pseudo-labels, instead of soft pseudo-labels, were used for consistency regularization, then the ABC was biased toward certain classes. 6) If the ABC was solely used without the backbone, the performance decreased because the ABC could not use high-quality representations learned by the backbone. 7) When we used a re-weighting technique [13] instead of a mask for the ABC, training became unstable because of abnormally large gradients calculated for training on the data of the minority classes. 8) The decoupled training of the backbone and ABC resulted in decreased classification performance, as was also analyzed in Section 4.4. Similarly, we present the results of the ablation study for FixMatch+ABC in Appendix L.

Table 5: Ablation study for ReMixMatch+ABC on CIFAR-

10

-LT,

\gamma=100

\beta=20\%

Ablation study	Overall	Minority
ReMixMatch+ABC (proposed algorithm)	$\mathbf{82.4}$	$\mathbf{75.7}$
Without gradually decreasing the parameter of $\mathcal{B}\left(\cdot\right)$ for consistency regularization	$81.8$	$74.6$
Without consistency regularization for the ABC	$79.4$	$66.9$
Without using the 0/1 mask for the consistency regularization loss $L_{con}$	$79.0$	$69.2$
Without using the 0/1 mask for the classification loss $L_{cls}$	$74.4$	$57.8$
Without using the confidence threshold $\tau$ for consistency regularization	$74.3$	$75.4$
Using hard pseudo labels for consistency regularization	$70.2$	$75.1$
Without training backbone (ABC without SSL backbone)	$68.7$	$56.2$
Training the ABC with a re-weighting technique	$81.2$	$74.1$
Decoupled training of the backbone and ABC	$79.5$	$72.3$

5 Conclusion

We introduced the ABC, which is attached to a state-of-the-art SSL algorithm, for CISSL. The ABC can utilize high-quality representations learned by the backbone, while being trained to make class-balanced predictions. The ABC also utilizes unlabeled data by conducting consistency regularization in a modified way for class-imbalance problems. The experimental results obtained under various settings demonstrate that the proposed algorithm outperforms the baseline algorithms. We also conducted a qualitative analysis and an ablation study to verify the contribution of each element of the proposed algorithm. The proposed algorithm assumes that the labeled and unlabeled data are class-imbalanced to the same extent. In the future, we plan to release this assumption by adopting a module for estimating class distribution. Deep learning algorithms can be applied to many societal problems. However, if the training data are imbalanced, the algorithms could be trained to make socially biased decisions in favor of the majority groups. The proposed algorithm can contribute to solving these issues. However, there is also a potential risk that the proposed algorithm could be used as a tool to identify minorities and discriminate against them. It should be ensured that the proposed method cannot be used for any purpose that may have negative social impacts.

6 Acknowledgments

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (2018R1C1B6004511, 2020R1A4A10187747).

References

An et al., [2021] An, J., Ying, L., and Zhu, Y. (2021). Why resampling outperforms reweighting for correcting sampling bias with stochastic gradients. In International Conference on Learning Representations.
Barandela et al., [2003] Barandela, R., Rangel, E., Sánchez, J. S., and Ferri, F. J. (2003). Restricted decontamination for the imbalanced training sample problem. In Iberoamerican congress on pattern recognition, pages 424–431. Springer.
[3] Berthelot, D., Carlini, N., Cubuk, E. D., Kurakin, A., Sohn, K., Zhang, H., and Raffel, C. (2019a). Remixmatch: Semi-supervised learning with distribution matching and augmentation anchoring. In International Conference on Learning Representations.
[4] Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., and Raffel, C. A. (2019b). Mixmatch: A holistic approach to semi-supervised learning. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
Buda et al., [2018] Buda, M., Maki, A., and Mazurowski, M. A. (2018). A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks, 106:249–259.
Cao et al., [2019] Cao, K., Wei, C., Gaidon, A., Arechiga, N., and Ma, T. (2019). Learning imbalanced datasets with label-distribution-aware margin loss. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
Chawla et al., [2002] Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–357.
Cubuk et al., [2020] Cubuk, E. D., Zoph, B., Shlens, J., and Le, Q. V. (2020). Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 702–703.
Cui et al., [2019] Cui, Y., Jia, M., Lin, T.-Y., Song, Y., and Belongie, S. (2019). Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9268–9277.
DeVries and Taylor, [2017] DeVries, T. and Taylor, G. W. (2017). Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552.
Gidaris et al., [2018] Gidaris, S., Singh, P., and Komodakis, N. (2018). Unsupervised representation learning by predicting image rotations. In International Conference on Learning Representations.
Grandvalet and Bengio, [2005] Grandvalet, Y. and Bengio, Y. (2005). Semi-supervised learning by entropy minimization. In Saul, L., Weiss, Y., and Bottou, L., editors, Advances in Neural Information Processing Systems, volume 17. MIT Press.
He and Garcia, [2009] He, H. and Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9):1263–1284.
Huang et al., [2016] Huang, C., Li, Y., Loy, C. C., and Tang, X. (2016). Learning deep representation for imbalanced classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5375–5384.
Jamal et al., [2020] Jamal, M. A., Brown, M., Yang, M.-H., Wang, L., and Gong, B. (2020). Rethinking class-balanced methods for long-tailed visual recognition from a domain adaptation perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7610–7619.
JAPKOWICZ, [2000] JAPKOWICZ, N. (2000). The class imbalance problem: Significance and strategies. In Proc. 2000 International Conference on Artificial Intelligence, volume 1, pages 111–117.
Kang et al., [2020] Kang, B., Xie, S., Rohrbach, M., Yan, Z., Gordo, A., Feng, J., and Kalantidis, Y. (2020). Decoupling representation and classifier for long-tailed recognition. In International Conference on Learning Representations.
[18] Kim, J., Hur, Y., Park, S., Yang, E., Hwang, S. J., and Shin, J. (2020a). Distribution aligning refinery of pseudo-label for imbalanced semi-supervised learning. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H., editors, Advances in Neural Information Processing Systems, volume 33, pages 14567–14579. Curran Associates, Inc.
[19] Kim, J., Jeong, J., and Shin, J. (2020b). M2m: Imbalanced classification via major-to-minor translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13896–13905.
Kingma and Ba, [2015] Kingma, D. P. and Ba, J. (2015). Adam: A method for stochastic optimization. In ICLR, 2015.
Krizhevsky, [2009] Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Technical report, Department of Computer Science, University of Toronto.
Lee et al., [2013] Lee, D.-H. et al. (2013). Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, volume 3.
Mikolov et al., [2013] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Burges, C. J. C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K. Q., editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc.
Miyato et al., [2018] Miyato, T., Maeda, S.-i., Koyama, M., and Ishii, S. (2018). Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 41(8):1979–1993.
Netzer et al., [2011] Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. (2011). Reading digits in natural images with unsupervised feature learning. In NIPS Workshop, 2011.
Park et al., [2018] Park, S., Park, J., Shin, S.-J., and Moon, I.-C. (2018). Adversarial dropout for supervised and semi-supervised learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32.
Ren et al., [2020] Ren, J., Yu, C., sheng, s., Ma, X., Zhao, H., Yi, S., and Li, h. (2020). Balanced meta-softmax for long-tailed visual recognition. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H., editors, Advances in Neural Information Processing Systems, volume 33, pages 4175–4186. Curran Associates, Inc.
Ren et al., [2018] Ren, M., Zeng, W., Yang, B., and Urtasun, R. (2018). Learning to reweight examples for robust deep learning. In International Conference on Machine Learning, pages 4334–4343. PMLR.
Sohn et al., [2020] Sohn, K., Berthelot, D., Carlini, N., Zhang, Z., Zhang, H., Raffel, C. A., Cubuk, E. D., Kurakin, A., and Li, C.-L. (2020). Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Advances in Neural Information Processing Systems, 33.
Tarvainen and Valpola, [2017] Tarvainen, A. and Valpola, H. (2017). Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
Van der Maaten and Hinton, [2008] Van der Maaten, L. and Hinton, G. (2008). Visualizing data using t-sne. Journal of machine learning research, 9(11).
Verma et al., [2019] Verma, V., Lamb, A., Kannala, J., Bengio, Y., and Lopez-Paz, D. (2019). Interpolation consistency training for semi-supervised learning. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pages 3635–3641. International Joint Conferences on Artificial Intelligence Organization.
Wang et al., [2017] Wang, Y.-X., Ramanan, D., and Hebert, M. (2017). Learning to model the tail. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
Wei et al., [2021] Wei, C., Sohn, K., Mellina, C., Yuille, A., and Yang, F. (2021). Crest: A class-rebalancing self-training framework for imbalanced semi-supervised learning. arXiv preprint arXiv:2102.09559.
Yang and Xu, [2020] Yang, Y. and Xu, Z. (2020). Rethinking the value of labels for improving class-imbalanced learning. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H., editors, Advances in Neural Information Processing Systems, volume 33, pages 19290–19301. Curran Associates, Inc.
Yin et al., [2018] Yin, X., Yu, X., Sohn, K., Liu, X., and Chandraker, M. (2018). Feature transfer learning for deep face recognition with under-represented data. arXiv e-prints, pages arXiv–1803.
Yu et al., [2015] Yu, F., Zhang, Y., Song, S., Seff, A., and Xiao, J. (2015). Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365.
Zagoruyko and Komodakis, [2016] Zagoruyko, S. and Komodakis, N. (2016). Wide residual networks. arXiv preprint arXiv:1605.07146.
Zhai et al., [2019] Zhai, X., Oliver, A., Kolesnikov, A., and Beyer, L. (2019). S4l: Self-supervised semi-supervised learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1476–1485.

Appendix A Overall procedure of consistency regularization for ABC

Figure 5 illustrates the overall procedure of consistency regularization for the ABC. Detailed procedure is described in Section 3.4 of the main paper.

Appendix B Pseudo code of the proposed algorithm

The pseudo code of the proposed algorithm is presented in Algorithm 1. The for loop (lines 2 14) can be run in parallel. The classification loss $L_{cls}$ and consistency regularization loss $L_{con}$ are expressed in detail in Sections 3.3 and 3.4 of the main paper.

Algorithm 1 Pseudo code of the proposed algorithm

Input: $\mathcal{MB}_{\mathcal{X}}=\left\{\left(x_{b},y_{b}\right):b\in\left(1,...,B\right)\right\}\subset\mathcal{X}$ , $\mathcal{MB}_{\mathcal{U}}=\left\{\left(u_{b}\right):b\in\left(1,\ldots,B\right)\right\}\subset\mathcal{U}$
Output: Classification model $f:\mathbb{R}^{d}\rightarrow\left\{1,...L\right\}$
Parameters : $\boldsymbol{\theta}$ (Parameters of Wide ResNet- $28$ - $2$ and ABC)

1:while Training do

2: for

b=1

B

\alpha\left(x_{b}\right)=

Augment

\left(x_{b}\right)

\alpha\left(u_{b}\right)=

WeakAugment

\left(u_{b}\right)

\mathcal{A}_{k}\left(u_{b}\right)=

StrongAugment

{}_{k}\left(x_{b}\right)

k=1,2

6: Predicted class distribution for

\alpha\left(x_{b}\right)=p_{s}\left(y|\alpha\left(x_{b}\right)\right)

7: Generate

0/1

mask

M\left(x_{b}\right)

8: Soft pseudo label

q_{b}

=

p_{s}\left(y|\alpha\left(u_{b}\right)\right)

9: if

\max\left(q_{b}\right)\geq 0.95

then

10: Predicted class distribution for

\mathcal{A}_{k}\left(u_{b}\right)=p_{s}\left(y|\mathcal{A}_{k}\left(u_{b}\right)\right),k=1,2

11: Generate

0/1

mask

M\left(u_{b}\right)

12: end if

13: Loss from the backbone

L_{back}

+= backbone

\left(\alpha\left(x_{b}\right),\alpha\left(u_{b}\right),\mathcal{A}_{k}\left(u_{b}\right)\right)

14: end for

15: Calculate the classification loss

L_{cls}

and consistency regularization loss

L_{con}

16: Total Loss

L_{total}=L_{cls}+L_{con}+L_{back}

17:

\Delta\boldsymbol{\theta}\propto\nabla_{\boldsymbol{{\theta}}}L_{total}

\quad\boldsymbol{\theta}\leftarrow\boldsymbol{\theta}+\Delta\boldsymbol{\theta}

18:end while

Appendix C Two types of class imbalance for the considered datasets

Two types of class imbalance for the considered datasets are illustrated in Figure 6. For both types of imbalance, we set $\gamma=100$ , $N_{1}=1000$ , and $\beta=20\%$ . In Figure 6 (b), we can see that each minority class has a very small amount of data. Existing SSL algorithms can be significantly biased toward majority classes under step imbalanced settings.

Appendix D Specification of the confidence threshold $\tau$

Table 6: Mean and standard deviation (STD) of validation accuracy during the last 50 epochs

ReMixMatch+ABC	CIFAR-10-LT, $\gamma=100$ , $\beta=20\%$
$\tau$	1	0.98	0.95	0.9	0.85	0.8	0.75	0.7
Mean and STD	78.9, 0.36	81.8, 0.34	82.3, 0.2	81.3, 0.32	81.5, 0.39	81.2, 0.63	80.0, 2.87	79.0, 5.76

In general, the confidence threshold $\tau$ should be set high enough, but not too high. If $\tau$ is low, training becomes unstable because many misclassified unlabeled data points would be used for training. However, if $\tau$ is too high, most of the unlabeled data points would not be used for consistency regularization. Based on these insights, we set $\tau$ as 0.95 in our experiments. We confirmed via experiments that this value of $\tau$ enabled high accuracy as well as stability. Specifically, we conducted experiments on CIFAR- $10$ -LT for the main setting while changing the value of $\tau$ . We measured the validation accuracy of ReMixMatch+ABC during the last 50 epochs (1 epoch=500 iterations) of training and calculated the mean and standard deviation (STD) of these values. As can be seen from Table 6, the proposed algorithm achieved the highest mean and lowest STD of the validation accuracy when $\tau$ was 0.95. When $\tau$ was set higher or lower than 0.95, the mean of the validation accuracy decreased. In particular, as the value of $\tau$ decreased from 0.95, the STD increased rapidly, indicating instability of the training.

Appendix E Further details of the experimental setup

We describe further details of the experimental setup. To train the ReMixMatch, we gradually increased the coefficient of the loss associated with the unlabeled data points, following [18]. We found that without this gradual increase, the validation loss of the ReMixMatch did not converge. To train the FixMatch, we used the labeled dataset once more as an unlabeled dataset by removing the labels for the experiments using CIFAR- $100$ following the previous study [29], but not for the experiments using CIFAR- $10$ and SVHN, because it did not improve the performance. We followed the default settings for the ReMixMatch [3] and FixMatch [29], unless mentioned otherwise.

To train the ABC, we also gradually decreased the parameter of $\mathcal{B\left(\cdot\right)}$ for calculating the classification loss in the experiments using CIFAR- $10$ and SVHN under the step imbalanced setting. This prevents unstable training by allowing each labeled data point of the majority classes to be more frequently used for training.

Appendix F Geometric mean (G-mean) of class-wise accuracy under the main setting

Table 7: Performance comparison using G-mean for the main setting

	CIFAR- $10$ -LT	SVHN-LT	CIFAR- $100$ -LT
Algorithm	$\gamma=100$ , $\beta=20\%$	$\gamma=100$ , $\beta=20\%$	$\gamma=20$ , $\beta=40\%$
FixMatch [29]	$62.0$	$87.3$	$38.5$
w/ CReST+PDA [34]	$74.4$	$88.6$	$42.3$
w/ DARP [18]	$71.5$	$87.6$	$40.4$
w/ DARP+cRT [18]	$76.7$	$89.8$	$47.0$
w/ ABC	80.5	91.8	49.0
ReMixMatch [3]	$62.5$	$89.5$	$41.2$
w/ CReST+PDA [34]	$72.2$	$90.7$	$43.1$
w/ DARP [18]	$71.9$	$89.7$	$42.5$
w/ DARP+cRT [18]	$77.9$	$92.0$	$48.3$
w/ ABC	81.9	93.8	50.8

To evaluate whether the proposed algorithm performs in a balanced way for all classes, we also measured the performance for the main setting using the geometric mean (G-mean) of class-wise accuracy with correction to avoid zeroing. We set the hyperparameter for the correction to avoid zeroing as 1 $\%$ , which indicates that the minimum class-wise accuracy is 1 $\%$ . The results in Table 7 demonstrates that the proposed algorithm performed in a balanced way.

Appendix G Experimental results on SVHN and CIFAR- $100$ under various settings

For the experiments using SVHN with $\gamma=150$ and $\beta=20\%$ , the solution of the convex optimization problem of ReMixMatch+DARP+cRT for refining the pseudo labels did not converge, and thus we could not measure the performance. The experimental results for SVHN and CIFAR- $100$ under various settings showed the same trend as those for CIFAR- $10$ , which is described in Section 4.2 of the main paper.

Table 8: Overall accuracy/minority-class-accuracy on SVHN under various settings

SVHN-LT
Algorithm	$\gamma=100$ , $\beta=10\%$	$\gamma=100$ , $\beta=30\%$	$\gamma=50$ , $\beta=20\%$	$\gamma=150$ , $\beta=20\%$
FixMatch [29]	$88.5$ $\pm 0.25$ / $80.3$ $\pm 0.42$	$88.7$ $\pm 0.36$ / $80.7$ $\pm 0.65$	$91.1$ $\pm 0.18$ / $85.3$ $\pm 0.28$	$85.6$ $\pm 0.17$ / $74.6$ $\pm 0.43$
w/ CReST + PDA [34]	$89.2$ $\pm 0.43$ / $81.7$ $\pm 0.95$	$89.9$ $\pm 0.36$ / $83.0$ $\pm 0.37$	$91.7$ $\pm 0.86$ / $87.6$ $\pm 0.53$	$86.7$ $\pm 0.89$ / $76.7$ $\pm 1.70$
w/ DARP + cRT [18]	$89.3$ $\pm 0.33$ / $83.9$ $\pm 0.47$	$90.7$ $\pm 0.28$ / $84.8$ $\pm 0.37$	$92.1$ $\pm 0.30$ / $87.7$ $\pm 0.44$	$88.0$ $\pm 0.74$ / $80.1$ $\pm 1.88$
w/ ABC	92.3 $\pm 0.38$ / 88.7 $\pm 0.92$	92.3 $\pm 0.34$ / 88.3 $\pm 0.49$	93.5 $\pm 0.17$ / 90.7 $\pm 0.25$	91.2 $\pm 0.15$ / 86.2 $\pm 0.15$
ReMixMatch [3]	$89.2$ $\pm 0.17$ / $81.7$ $\pm 0.41$	$90.7$ $\pm 0.15$ / $84.5$ $\pm 0.46$	$92.4$ $\pm 0.21$ / $87.8$ $\pm 0.48$	$88.6$ $\pm 0.16$ / $80.4$ $\pm 0.42$
w/ CReST + PDA [34]	$89.8$ $\pm 0.12$ / $83.0$ $\pm 0.08$	$91.2$ $\pm 0.17$ / $85.3$ $\pm 0.24$	$93.3$ $\pm 0.02$ / $90.0$ $\pm 0.36$	$88.8$ $\pm 0.41$ / $80.7$ $\pm 0.82$
w/ DARP + cRT [18]	$91.7$ $\pm 0.26$ / $86.6$ $\pm 0.45$	$93.2$ $\pm 0.08$ / $89.3$ $\pm 0.21$	$93.6$ $\pm 0.41$ / $90.4$ $\pm 0.52$	$-$ / $-$
w/ ABC	93.2 $\pm 0.64$ / 92.2 $\pm 0.44$	94.4 $\pm 0.37$ / 93.3 $\pm 0.32$	94.7 $\pm 0.35$ / 93.5 $\pm 0.56$	93.2 $\pm 0.46$ / 91.8 $\pm 0.79$

Table 9: Overall accuracy/minority-class-accuracy on CIFAR-

100

under various settings

CIFAR- $100$ -LT
Algorithm	$\gamma=20$ , $\beta=20\%$	$\gamma=20$ , $\beta=50\%$	$\gamma=10$ , $\beta=40\%$	$\gamma=30$ , $\beta=40\%$
FixMatch [29]	$46.1$ $\pm 0.23$ / $26.6$ $\pm 0.34$	$52.3$ $\pm 0.54$ / $34.7$ $\pm 0.80$	$57.4$ $\pm 0.15$ / $44.8$ $\pm 0.17$	$47.6$ $\pm 0.09$ / $27.6$ $\pm 0.21$
w/ CReST + PDA [34]	$46.7$ $\pm 0.49$ / $29.3$ $\pm 0.54$	$52.7$ $\pm 0.06$ / $37.4$ $\pm 0.37$	$57.3$ $\pm 0.23$ / $47.5$ $\pm 0.22$	$48.5$ $\pm 0.06$ / $30.0$ $\pm 0.04$
w/ DARP + cRT [18]	$48.9$ $\pm 0.11$ / $33.5$ $\pm 0.17$	$55.9$ $\pm 0.43$ / $43.5$ $\pm 1.28$	$59.0$ $\pm 0.40$ / $50.4$ $\pm 1.09$	$51.3$ $\pm 0.29$ / $36.4$ $\pm 0.50$
w/ ABC	49.7 $\pm 0.40$ / 34.6 $\pm 0.76$	58.3 $\pm 0.74$ / 46.7 $\pm 1.12$	61.6 $\pm 0.15$ / 53.0 $\pm 0.26$	53.6 $\pm 0.35$ / 38.8 $\pm 0.69$
ReMixMatch [3]	$49.0$ $\pm 0.29$ / $29.9$ $\pm 0.42$	$54.4$ $\pm 0.13$ / $37.8$ $\pm 0.12$	$59.5$ $\pm 0.20$ / $47.1$ $\pm 0.42$	$51.0$ $\pm 0.11$ / $32.0$ $\pm 0.50$
w/ CReST + PDA [34]	$49.4$ $\pm 0.32$ / $31.8$ $\pm 0.15$	$54.4$ $\pm 0.21$ / $38.6$ $\pm 0.35$	$58.8$ $\pm 0.08$ / $47.6$ $\pm 0.24$	$51.9$ $\pm 0.34$ / $33.5$ $\pm 0.69$
w/ DARP + cRT [18]	$50.2$ $\pm 0.40$ / $35.2$ $\pm 0.55$	$54.6$ $\pm 1.75$ / $44.8$ $\pm 2.09$	$59.4$ $\pm 1.04$ / $52.1$ $\pm 0.71$	$52.8$ $\pm 0.24$ / $38.4$ $\pm 0.30$
w/ ABC	52.5 $\pm 0.10$ / 38.5 $\pm 0.42$	59.3 $\pm 0.66$ / 49.5 $\pm 1.02$	63.5 $\pm 0.29$ / 57.1 $\pm 0.06$	55.4 $\pm 0.46$ / 42.8 $\pm 0.67$

Appendix H Experimental results on SVHN and CIFAR- $100$ under the step imbalanced setting

Experimental results for SVHN and CIFAR- $100$ under the step imbalanced setting showed the same tendency as that for CIFAR- $10$ , which is described in Section 4.2 of the main paper.

Table 10: Overall accuracy/minority-class-accuracy on SVHN under step imbalanced setting

SVHN-Step, $\gamma=100$ , $\beta=20\%$
Algorithm	w/ -	w/ CReST + PDA [34]	w/ DARP + cRT [18]	w/ ABC
FixMatch [29]	$79.8$ $\pm 1.34$ / $61.5$ $\pm 2.76$	$86.6$ $\pm 0.19$ / $76.3$ $\pm 0.23$	$85.9$ $\pm 0.28$ / $74.3$ $\pm 0.37$	91.2 $\pm 0.15$ / 85.6 $\pm 0.35$
ReMixMatch [3]	$82.7$ $\pm 0.42$ / $67.4$ $\pm 0.81$	$85.9$ $\pm 0.13$ / $73.9$ $\pm 0.16$	$90.5$ $\pm 1.13$ / $84.3$ $\pm 1.86$	91.3 $\pm 1.61$ / 89.8 $\pm 0.95$

Table 11: Overall accuracy/minority-class-accuracy on CIFAR-

100

under step imbalanced setting

CIFAR- $100$ -Step, $\gamma=20$ , $\beta=40\%$
Algorithm	w/ -	w/ CReST + PDA [34]	w/ DARP + cRT [18]	w/ ABC
FixMatch [29]	$46.7$ $\pm 0.29$ / $15.0$ $\pm 0.26$	$49.9$ $\pm 0.24$ / $26.7$ $\pm 0.41$	$50.7$ $\pm 0.61$ / $28.8$ $\pm 2.04$	54.7 $\pm 0.06$ / 32.1 $\pm 0.12$
ReMixMatch [3]	$47.3$ $\pm 0.12$ / $16.5$ $\pm 1.06$	$48.5$ $\pm 0.18$ / $19.2$ $\pm 0.35$	$53.6$ $\pm 0.28$ / $35.0$ $\pm 1.04$	56.0 $\pm 0.46$ / 38.3 $\pm 0.55$

Appendix I Floating point operations per second (FLOPS) of each algorithm

As we mentioned in Section 4.3 of the main paper, computation cost required for the algorithms combined with DARP increased as the number of classes or the amount of data increased. In contrast, computation cost required for the proposed algorithm did not significantly increased because the whole training procedure can be carried out using minibatches. FLOPS of FixMatch+CReST and ReMixMatch+CReST are the same as those of FixMatch and ReMixMatch, but the algorithms combined with CReST required iterative re-training with a labeled set expanded by adding unlabeled data points with pseudo labels. We measured FLOPS using Nvidia Tesla-V100. For the experiments on CIFAR- $10$ and CIFAR- $100$ , we used only one GPU, whereas we used four GPUs in parallel for the experiments on LSUN.

Table 12: FLOPS of each algorithm

Algorithm	CIFAR- $10$	CIFAR- $100$	LSUN
FixMatch [29]	$14.5$ iter/sec	$14.7$ iter/sec	$2.6$ iter/sec
FixMatch+DARP [18]	$12.0$ iter/sec	$6.3$ iter/sec	$0.4$ iter/sec
FixMatch+ABC	$11.2$ iter/sec	$11.0$ iter/sec	$2.0$ iter/sec
ReMixMatch [3]	$6.9$ iter/sec	$6.9$ iter/sec	$1.3$ iter/sec
ReMixMatch+DARP [18]	$6.3$ iter/sec	$4.5$ iter/sec	$0.3$ iter/sec
ReMixMatch+ABC	$5.8$ iter/sec	$5.6$ iter/sec	$0.9$ iter/sec

Appendix J Further qualitative analysis and quantitative comparison

Figure 7 (b) presents biased predictions of FixMatch [29], a recent SSL algorithm, trained on CIFAR- $10$ with the amount of Class 0 being 100 times more than that of Class 9 as depicted in Figure 7 (a). In contrast, Figure 7 (c) presents that the class distribution of the predicted labels became more balanced using the FixMatch+ABC trained on the same dataset. These results are consistent with those in Figure 1 of the main paper.

Because the use of the $0/1$ mask for the ABC plays a similar role of re-sampling techniques, we compare the representations of proposed algorithm with those of SMOTE (oversampling technique) [7]+CNN, and random undersampling [13]+CNN. Figure 8 (a), (b) and (c) present the t-SNE representations obtained using SMOTE+CNN, undersampling+CNN, and ABC only. Because re-sampling techniques can only be applied to labeled data, they cannot be combined with the SSL algorithms, and thus they were combined with CNN instead. SMOTE+CNN and undersampling+CNN learned less separable representations than the ABC only. These results show that using the $0/1$ mask instead of re-sampling techniques is more effective because we could utilize unlabeled data. In addition, the 0/1 mask enabled the ABC to be combined with the backbone, so that the ABC could use the high-quality representations learned by the backbone as shown in Figure 8 (d).

We also compared the performance of the proposed algorithm with those of SMOTE+CNN and undersampling+CNN. The results in Table 13 show the importance of using unlabeled data for training and using the high-quality representations obtained from backbone.

Table 13: Performance of each algorithm in Figure 8 and FixMatch+ABC. The algorithms were trained on CIFAR-

10

-LT,

\gamma=100

\beta=20\%

and tested on the test set of CIFAR-

10

Performance of each algorithm in Figure 8 and FixMatch+ABC	Overall	Minority
ReMixMatch+ABC	$\mathbf{82.4}$	$\mathbf{75.7}$
FixMatch+ABC	$81.1$	$72.0$
Without training backbone (ABC only)	$68.7$	$56.2$
SMOTE+CNN	$60.8$	$46.1$
Undersampling+CNN	$48.7$	$55.8$

Figure 9 presents the confusion matrices of FixMatch, FixMatch+DARP+cRT, and FixMatch+ABC trained on CIFAR- $10$ -LT, $\gamma=100$ , $\beta=20\%$ . Similar to Figure 4 of the main paper, FixMatch and FixMatch+DARP+cRT often misclassified test data points in the minority classes (e.g., classes $8$ and $9$ into classes $0$ and $1$ ). In contrast, FixMatch+ABC classified the test data points in the minority classes with higher accuracy, and produced a significantly more balanced class-distribution than FixMatch and FixMatch+DARP+cRT.

Figure 10 presents the confusion matrices of the predictions on the unlabeled data. Similar to Figure 9 and Figure 4 of the main paper, FixMatch+ABC and ReMixMatch+ABC classified the unlabeled data points in the minority classes with higher accuracy, and produced a significantly more balanced pseudo labels than other algorithms. By using these balanced pseudo labels for training, the proposed algorithm could make a more balanced prediction on the test set.

Appendix K Detailed comparison between the end-to-end training of the proposed algorithm and decoupled learning of representations and a classifier

Although FixMatch+DARP+cRT and ReMixMatch+DARP+cRT also use the representations learned by ReMixMatch [3] and FixMatch [29], they showed worse performance than the proposed algorithm. The performance gap between FixMatch(ReMixMatch)+DARP+cRT and the proposed algorithm results from the different characteristics of the ABC versus DARP+cRT as follows. First, whereas DARP+cRT decouples learning of representations and training of a classifier, the ABC is trained end-to-end interactively with representations that the backbone learns. Second, DARP+cRT does not use unlabeled data for training of its classifier after representations learning is finished, while the ABC is trained with unlabeled data to conduct consistency regularization so that decision boundaries can be placed in a low density region. To verify these reasons, we compare the validation loss graphs of the algorithms based on end-to-end training and decoupled learning of representations and a classifier in Figure 11. We recorded the validation loss of 100 epochs after the representations were fixed, where 1 epoch was set as 500 iterations. For the proposed algorithm, we recorded the validation loss of the last 100 epochs. In Figure 9 (a) and (b), we can see that the validation loss of the algorithms based on decoupled learning of representations and a classifier tended to increase after a few epochs. The validation loss was reduced by conducting consistency regularization (C/R) using unlabeled data, but it still tended to increase. In the case of ReMixMatch+DARP+cRT+C/R* and FixMatch+DARP+cRT+C/R*, which do not fix the representations (algorithms marked with *), high-quality representations learned by the backbone were gradually replaced by the representations learned with a re-balanced classifier, which caused overfitting on minority classes. We can observe a similar tendency in Figure 9 (c) under the supervised learning setting. In contrast, the validation loss of ReMixMatch+ABC, FixMatch+ABC, and the proposed algorithm under supervised learning setting decreased steadily and achieved the lowest validation loss. The performances of the algorithms based on end-to-end training and decoupled learning of representations and a classifier are summarized in Table 14.

Table 14: Performance of the algorithms based on end-to-end training and decoupled learning of representations and a classifier. The algorithms were trained on the training set described in the caption of Figure 11. The algorithms were tested on the test set of CIFAR-

10

Performance of the algorithms based on end-to-end training versus decoupled learning	Overall	Minority
Under the semi-supervised learning setting
ReMixMatch+ABC (end-to-end training)	$\mathbf{82.4}$	$\mathbf{75.7}$
ReMixMatch+DARP+cRT+C/R* (Decoupled learning)	$80.6$	$71.4$
ReMixMatch+DARP+cRT+C/R (Decoupled learning)	$79.5$	$70.4$
ReMixMatch+DARP+cRT (Decoupled learning)	$78.5$	$66.4$
FixMatch+ABC (end-to-end training)	$\mathbf{81.1}$	$\mathbf{72.0}$
FixMatch+DARP+cRT+C/R* (Decoupled learning)	$80.3$	$71.6$
FixMatch+DARP+cRT+C/R (Decoupled learning)	$78.7$	$68.3$
FixMatch+DARP+cRT (Decoupled learning)	$78.1$	$66.6$
Under the supervised learning setting
End-to-End training of CNN with the ABC (end-to-end training)	$\mathbf{84.9}$	$\mathbf{80.6}$
cRT* (Decoupled learning of representations and the classifier of CNN)	$81.1$	$79.6$
cRT (Decoupled learning of representations and the classifier of CNN)	$80.0$	$79.9$

Appendix L Ablation study for FixMatch [29] + ABC on CIFAR-10

Results in Table 15 show a similar tendency as that for ReMixMatch+ABC in Section 4.5 of the main paper.

Table 15: Ablation study for FixMatch+ABC on CIFAR-

10

-LT,

\gamma=100

\beta=20\%

Ablation study	Overall	Minority
FixMatch+ABC (proposed algorithm)	$\mathbf{81.1}$	$\mathbf{72.0}$
Without gradually decreasing the parameter of $\mathcal{B}\left(\cdot\right)$ for consistency regularization	$80.2$	$70.1$
Without consistency regularization for the ABC	$76.2$	$60.9$
Without using the 0/1 mask for the consistency regularization loss $L_{con}$	$74.9$	$58.8$
Without using the 0/1 mask for the classification loss $L_{cls}$	$77.1$	$62.7$
Without using the confidence threshold $\tau$ for consistency regularization	$79.2$	$67.8$
Using hard pseudo labels for consistency regularization	$78.8$	$68.0$
Without training backbone (ABC only)	$68.7$	$56.2$
Training the ABC with a re-weighting technique	$80.3$	$70.5$
Decoupled training of the backbone and ABC	$77.4$	$65.0$


(a) SMOTE [7]+CNN	(b) Undersampling [13]+CNN	(c) ABC only	(d) ReMixMatch+ABC


(a)FixMatch	(b)FixMatch+DARP+cRT	(c) FixMatch+ABC

(d)ReMixMatch	(e)ReMixMatch+DARP+cRT	(f) ReMixMatch+ABC


(a) Class-imbalanced training set	(b) ReMixMatch	(c) Proposed algorithm


(a)ABC (without SSL backbone)	(b)FixMatch+ABC	(c) ReMixMatch+ABC


(a) Class-imbalanced training set	(b) FixMatch	(c) FixMatch+ABC


(a)ReMixMatch	(b)ReMixMatch+DARP+cRT	(c) ReMixMatch+ABC

ABC: Auxiliary Balanced Classifier for Class-imbalanced Semi-supervised Learning

Abstract

1 Introduction

2 Related Work

3 Methodology

3.1 Problem setting

3.2 Backbone SSL algorithm

3.3 ABC for class-imbalanced Semi-supervised learning

3.4 Consistency regularization for ABC

3.5 End-to-end training

4 Experiments

4.1 Experimental setup

4.2 Experimental results

4.3 Complexity of the proposed algorithm

4.4 Qualitative analysis of high-quality representations and balanced classification

4.5 Ablation study

5 Conclusion

6 Acknowledgments

References

Appendix A Overall procedure of consistency regularization for ABC

Appendix B Pseudo code of the proposed algorithm

Appendix C Two types of class imbalance for the considered datasets

Appendix D Specification of the confidence threshold τ\tau

Appendix E Further details of the experimental setup

Appendix F Geometric mean (G-mean) of class-wise accuracy under the main setting

Appendix G Experimental results on SVHN and CIFAR-100100 under various settings

Appendix H Experimental results on SVHN and CIFAR-100100 under the step imbalanced setting

Appendix I Floating point operations per second (FLOPS) of each algorithm

Appendix J Further qualitative analysis and quantitative comparison

Appendix K Detailed comparison between the end-to-end training of the proposed algorithm and decoupled learning of representations and a classifier

Appendix L Ablation study for FixMatch [29] + ABC on CIFAR-10

Appendix D Specification of the confidence threshold $\tau$

Appendix G Experimental results on SVHN and CIFAR- $100$ under various settings

Appendix H Experimental results on SVHN and CIFAR- $100$ under the step imbalanced setting