This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

SimPro: A Simple Probabilistic Framework
Towards Realistic Long-Tailed Semi-Supervised Learning

Chaoqun Du    Yizeng Han    Gao Huang
Abstract

Recent advancements in semi-supervised learning have focused on a more realistic yet challenging task: addressing imbalances in labeled data while the class distribution of unlabeled data remains both unknown and potentially mismatched. Current approaches in this sphere often presuppose rigid assumptions regarding the class distribution of unlabeled data, thereby limiting the adaptability of models to only certain distribution ranges. In this study, we propose a novel approach, introducing a highly adaptable framework, designated as SimPro, which does not rely on any predefined assumptions about the distribution of unlabeled data. Our framework, grounded in a probabilistic model, innovatively refines the expectation-maximization (EM) algorithm by explicitly decoupling the modeling of conditional and marginal class distributions. This separation facilitates a closed-form solution for class distribution estimation during the maximization phase, leading to the formulation of a Bayes classifier. The Bayes classifier, in turn, enhances the quality of pseudo-labels in the expectation phase. Remarkably, the SimPro framework not only comes with theoretical guarantees but also is straightforward to implement. Moreover, we introduce two novel class distributions broadening the scope of the evaluation. Our method showcases consistent state-of-the-art performance across diverse benchmarks and data distribution scenarios. Our code is available at https://github.com/LeapLabTHU/SimPro.

Machine Learning, ICML

Refer to caption
Figure 1: The general idea of SimPro addressing the ReaLTSSL problem. (a) Current methods typically rely on predefined or assumed class distribution patterns for unlabeled data, limiting their applicability. (b) In contrast, our SimPro embraces a more realistic scenario by introducing a simple and elegant framework that operates effectively without making any assumptions about the distribution of unlabeled data. This paradigm shift allows for greater flexibility and applicability in diverse ReaLTSSL scenarios.

1 Introduction

Semi-supervised learning (SSL) offers a viable solution to the scarcity of labeled data by leveraging unlabeled data (Tarvainen & Valpola, 2017; Berthelot et al., 2019b; Miyato et al., 2018; Sohn et al., 2020). Common SSL algorithms typically generate pseudo-labels for unlabeled data to facilitate model training (Lee et al., 2013). However, real-world data often adheres to a long-tailed distribution, leading to a predominant focus on majority classes and resulting in imbalanced pseudo-labels (Liu et al., 2019; Kang et al., 2020; Du et al., 2024). This phenomenon, known as long-tailed semi-supervised learning (LTSSL), presents significant challenges in the field. Traditional LTSSL methods (Lai et al., 2022; Lee et al., 2021; Wei et al., 2022, 2021; Kim et al., 2020) assume consistency in class distributions between labeled and unlabeled data, an often unrealistic premise. In practice, class distributions can be inconsistent and unknown, especially as new data are continuously collected or from different tasks. This ongoing integration process can lead to significant shifts in class distributions.

In response to these challenges, the concept of realistic long-tailed semi-supervised learning (ReaLTSSL), which aims at addressing the unknown and mismatched class distributions, has garnered significant attention (Kim et al., 2020; Wei et al., 2021; Oh et al., 2022; Wei & Gan, 2023). Notably, recent works ACR (Wei & Gan, 2023) and CPE (Ma et al., 2024) pre-define anchor distributions for unlabeled data (Fig. 1 (a)). The ACR estimates the distributional distance to adapt consistency regularization, while CPE involves training multiple classifiers, each tailored to a specific class distribution. However, this approach presupposes certain knowledge about the unlabeled data distribution, preventing its applications in real-world scenarios where anchor distributions may not represent all possible distributions. Furthermore, the prevailing techniques often employ multi-branch frameworks and introduce additional loss functions, adding complexity and limiting their generality.

To address these limitations, we propose a Simple Probabilistic (SimPro) framework for ReaLTSSL. We revisit pseudo-label-based SSL techniques through the lens of the Expectation-Maximization (EM) algorithm. The EM algorithm, a well-known iterative method in statistical modeling, is particularly relevant in SSL for handling unobserved latent variables, such as pseudo-labels of unlabeled data. The E-step entails generating pseudo-labels with the model, while the M-step involves model training using both labeled and unlabeled data. In the context of unknown and mismatched class distributions, the E-step may produce biased pseudo-labels, diminishing the algorithm’s effectiveness. Our SimPro avoids fixed assumptions about the unlabeled data distribution, instead of innovatively extending the EM algorithm for ReaLTSSL. Specifically, we explicitly decouple the modeling of conditional and marginal distributions. Such separation enables a closed-form solution for the marginal distribution in the M step. Subsequently, this solution is employed to train a Bayes classifier. This Bayes classifier, in turn, improves the quality of pseudo-labels generated in the E-step. Not only does SimPro offer high effectiveness, but it is also easy to implement, requiring minimal code modifications.

Moreover, we expand upon existing evaluation methods (Oh et al., 2022), which primarily focus on three known class distributions (consistent, uniform, and reversed), by introducing two novel realistic scenarios: middle and head-tail distributions (Fig. 1 (b)). The middle distribution represents a concentration of classes in the middle range of labeled data’s classes, whereas the head-tail distribution indicates a concentration at both extremes. Notably, our method is theoretically general enough to handle any other distribution patterns, since no prior assumptions are required.

We summarize our contributions as follows:

1. We present SimPro, a simple probabilistic framework tailored for realistic long-tailed semi-supervised learning. This framework does not presuppose any knowledge about the class distribution of unlabeled data. It hinges on the explicit estimation and utilization of class distributions within the EM algorithm. SimPro effectively mitigates the challenges posed by unknown and mismatched class distributions, stepping towards a more realistic LTSSL scenario.

2. We introduce two novel class distribution patterns for unlabeled data, complementing the existing three standard ones. This expansion facilitates a more comprehensive and realistic evaluation of ReaLTSSL algorithms, bridging the gap between theoretical models and practical applications.

3. Comprehensive experiments on five commonly used benchmarks (CIFAR10/100-LT, STL10-LT, and ImageNet-127/1k) and five distinct class distributions validate that our SimPro consistently achieves SOTA performance.

2 Related Work

Semi-supervised learning (SSL) has gained prominence through a subset of algorithms that use unlabeled data to enhance model performance. This enhancement primarily occurs through the generation of pseudo-labels, effectively forming a self-training loop (Miyato et al., 2018; Berthelot et al., 2019a, b; Huang & Du, 2022; Wang et al., 2023). Modern SSL methodologies, such as those presented in (Berthelot et al., 2019a; Sohn et al., 2020), integrate pseudo-labeling with consistency regularization. This integration fosters uniform predictions across varying representations of a single image, thereby bolstering the robustness of deep networks. A notable example, FixMatch (Sohn et al., 2020), has demonstrated exceptional results in image recognition tasks, outperforming competing SSL approaches.

The efficacy of SSL algorithms heavily relies on the quality of the pseudo-labels they generate. However, both labeled and unlabeled data follow a long-tailed class distribution in the LTSSL scenario. Conventional SSL methods are prone to produce biased pseudo-labels, which significantly downgrade their effectiveness.

Long-tailed semi-supervised learning has garnered considerable interest due to its relevance in numerous real-world applications. In this domain, DARP (Kim et al., 2020) and CReST (Wei et al., 2021) aim to mitigate the issue of biased pseudo-labels by aligning them with the class distribution of labeled data. Another notable approach (Lee et al., 2021) employs an auxiliary balanced classifier, which is trained through the down-sampling of majority classes, to enhance generalization capabilities. These algorithms have markedly improved performance but operate under the assumption of identical class distributions for labeled and unlabeled data.

In addressing Realistic LTSSL challenges, DASO (Oh et al., 2022) innovates by adapting the proportion of linear and semantic pseudo-labels to the unknown class distribution of unlabeled data. Its success largely depends on the discriminative quality of the representations, a factor that becomes less reliable in long-tailed distributions. ACR (Wei & Gan, 2023), on the other hand, attempts to refine consistency regularization by pre-defining distribution anchors and achieves promising results. CPE (Ma et al., 2024) trains multiple anchor experts where each is tasked to model one distribution. However, such anchor distribution-based approaches might not encompass all potential class distribution scenarios, and their complexity could hinder the broader application.

3 Method

In this section, we first introduce the problem formulation of ReaLTSSL (Sec. 3.1), setting the stage for our method. Subsequently, we delve into the proposed simple and probabilistic framework, SimPro (Sec. 3.2). We provide implementation details in Sec. 3.3 to elucidate SimPro further.

3.1 Preliminaries

Problem formulation.

We begin by outlining the formulation for the realistic long-tailed semi-supervised learning (ReaLTSSL) problem, laying the groundwork for our approach. The setup involves a labeled dataset 𝒟l={(xi,yi)}i=1N\mathcal{D}_{l}\!=\!\{(x_{i},y_{i})\}_{i=1}^{N} and an unlabeled dataset 𝒟u={xi}i=1M\mathcal{D}_{u}\!=\!\{x_{i}\}_{i=1}^{M}, where xidx_{i}\!\in\!\mathbb{R}^{d} represents the ii-th data sample and yi{0,1}Ky_{i}\!\in\!\{0,1\}^{K} is the corresponding one-hot label, with KK denoting the number of classes. The objective of ReaLTSSL is to train a classifier F𝜽:d{0,1}KF_{\bm{\theta}}:\mathbb{R}^{d}\!\mapsto\!\{0,1\}^{K}, parameterized by 𝜽\bm{\theta}.

Assumption 1.

We assume a realistic scenario where labeled, unlabeled, and test data share the same conditional distribution P(x|y)P(x|y), yet may exhibit distinct marginal distributions P(y)P(y). Crucially, the marginal distribution P(y)P(y) of the unlabeled data remains unknown.

Further, we consider five diverse distributions for the unlabeled data (Fig. 1), reflecting various real-world situations.

The EM algorithm in semi-supervised learning.

In SSL, pseudo-labeling is a key technique for leveraging unlabeled data. This involves creating pseudo-labels for the unlabeled data using the model and then training the model with both the pseudo-labeled and ground-truth data. This aligns with the Expectation-Maximization (EM) algorithm (Dempster et al., 1977), where the E-step generates pseudo-labels, and the M-step updates the parameters using the pseudo-labels, maximizing the likelihood function.

Our method builds on a popular algorithm FixMatch (Sohn et al., 2020), which integrates consistency regularization in the standard SSL setting. Pseudo-labels are created via weakly-augmented unlabeled data and applied to train strongly-augmented samples based on a confidence threshold. The loss for unlabeled data is

u(xi)=𝕀(max(qω)t)(argmax(qω),qΩ),\mathcal{L}_{u}(x_{i})=\mathbb{I}(\text{max}(q_{\omega})\geq t)\cdot\mathcal{H}(\arg\max(q_{\omega}),q_{\Omega}), (1)

where qωq_{\omega} and qΩq_{\Omega} represent the prediction logits for weakly and strongly augmented samples, respectively, \mathcal{H} denotes the cross-entropy loss, and tt is the confidence threshold.

Long-tailed learning.

In typical SSL scenarios, the assumption of identical distributions for labeled, unlabeled, and test data often prevails. However, long-tailed learning tasks usually involve imbalanced training sets and balanced test sets, leading to discrepancies in the prior distribution of P(y)P(y) between training and testing data. Some studies (Ren et al., 2020; Menon et al., 2021; Hong et al., 2021) tackle this via Bayesian inference, introducing a prior distribution over class labels:

l(x)\displaystyle\mathcal{L}_{l}(x) =logP(y|x;𝜽,𝝅)\displaystyle=-\log P(y|x;\bm{\theta},\bm{\pi})
=logP(y;𝝅)P(x|y;𝜽)P(x)\displaystyle=-\log\frac{P(y;\bm{\pi})P(x|y;\bm{\theta})}{P(x)}
=logϕyexp(f𝜽(x,y))yϕyexp(f𝜽(x,y)),\displaystyle=-\log\frac{\phi_{y}\exp(f_{\bm{\theta}}(x,y))}{\sum_{y^{\prime}}\phi_{y^{\prime}}\exp(f_{\bm{\theta}}(x,y^{\prime}))}, (2)

where ϕy\phi_{y} denotes the class frequency in the training or test set, 𝝅\bm{\pi} is the class distribution parameter and 𝜽\bm{\theta} is the parameter of P(x|y)/P(x)P(x|y)/P(x). Here we omit the parameter of P(x)P(x) for simplicity. The detailed mathematical derivation is provided in App. A.

While supervised learning allows for a known distribution parameter 𝝅\bm{\pi}, enabling a direct application to model P(y)P(y) and explicit decoupling from 𝜽\bm{\theta}, ReaLTSSL poses a greater challenge as the prior 𝝅\bm{\pi} for unlabeled data is unknown. This necessitates innovative approaches to adapt to the imbalanced data while maintaining model efficacy.

Refer to caption
Figure 2: The SimPro Framework Overview. This framework distinctively separates the conditional and marginal (class) distributions. In the E-step (top), pseudo-labels are generated using the current parameters 𝜽\bm{\theta} and 𝝅\bm{\pi}. In the subsequent M-step (bottom), these pseudo-labels, along with the ground-truth labels, are utilized to compute the Cross-Entropy loss (refer to Eq. 13), facilitating the optimization of network parameters 𝜽\bm{\theta} via gradient descent. Concurrently, the marginal distribution parameter 𝝅\bm{\pi} is recalculated using a closed-form solution based on the generated pseudo-labels (as detailed in Eq. 7).

3.2 SimPro Framework

Framework overview.

In the realistic semi-supervised learning (ReaLTSSL) context, the conventional assumption of independent and identically distributed (i.i.d.) labeled and unlabeled data is no longer valid. Moreover, the marginal (class) distribution P(y)P(y) of the unlabeled data may be inconsistent with that of the labeled data and remains unknown, which challenges the traditional SSL frameworks.

To overcome this, we introduce SimPro, an elegant and effective probabilistic framework adapted for the unique ReaLTSSL setting. Illustrated in Fig. 2, SimPro distinctively decouples 𝝅\bm{\pi} and 𝜽\bm{\theta}, unlike traditional SSL methods (Sohn et al., 2020). In the E-step, we generate pseudo-labels using the parameters 𝝅\bm{\pi} and 𝜽\bm{\theta} obtained from the previous M-step. The M-step then models the conditional distribution P(x|y)P(x|y) using network parameters 𝜽\bm{\theta}, which are optimized through gradient descent. Simultaneously, we derive a closed-form solution for the class distribution P(y)P(y), represented by 𝝅\bm{\pi}.

It is worth noting that the treatment of the 𝝅\bm{\pi} in our framework is not heuristic. It is firmly rooted in probabilistic modeling and the principles of the EM algorithm, providing theoretical soundness as substantiated in Props. 1 and 2.

Probabilistic model.

In addressing the ReaLTSSL challenge, we adopt an Expectation-Maximization (EM) approach, underpinned by a robust probabilistic model. The model is governed by the fundamental principles of conditional probability, as shown in:

P(𝒚,𝒙;𝜽,𝝅)=P(𝒚|𝒙;𝜽,𝝅)P(𝒙).P(\bm{y},\bm{x};\bm{\theta},\bm{\pi})=P(\bm{y}|\bm{x};\bm{\theta},\bm{\pi})P(\bm{x}). (3)

Here, we do not explicitly parameterize P(x)P({x}), as per the independence of parameters through conditional parameterization (Koller & Friedman, 2009). Thus, when x{x} is not a condition, the parameters of the relevant notions omit the parameters of P(x)P({x}), such as P(x)P({x}), P(x|y)P({x}|y), P(x,y)P({x},y), etc. According to Sec. 3.1, this may lead to a potential misunderstanding, as the equation P(x)=yP(x|y;𝜽)P(y;𝝅)P({x})=\sum_{y}P({x}|y;\bm{\theta})P(y;\bm{\pi}) seems to suggest that P(x)P({x}) is parameterized by 𝜽\bm{\theta} and 𝝅\bm{\pi}, which is not the case. The detailed mathematical derivation is provided in App. A.

We focus on estimating the parameters 𝜽\bm{\theta} and 𝝅\bm{\pi}, pivotal for learning a discriminative model. Consequently, we concentrate on those terms dependent on 𝜽\bm{\theta} and 𝝅\bm{\pi}, sidelining those independent of these parameters.

The complete data log-likelihood is thus articulated as:

P(𝒚|𝒙;𝜽,𝝅)=i=1NP(yi|xi;𝜽,𝝅l)j=1MP(yj|xj;𝜽,𝝅u),\small P(\bm{y}|\bm{x};\bm{\theta},\bm{\pi})=\prod_{i=1}^{N}P(y_{i}|x_{i};\bm{\theta},\bm{\pi}_{l})\prod_{j=1}^{M}P(y_{j}|x_{j};\bm{\theta},\bm{\pi}_{u}), (4)

where 𝝅={𝝅l,𝝅u}\bm{\pi}\!=\!\{\bm{\pi}_{l},\bm{\pi}_{u}\} signifies the class distributions for labeled and unlabeled data, respectively, with NN and MM representing the number of labeled/unlabeled samples.

E-step (generating pseudo-labels).

By Eq. 4, the expected complete data log-likelihood 𝒬\mathcal{Q} function is derived from the preceding iteration’s parameters, 𝜽\bm{\theta}^{\prime} and 𝝅\bm{\pi}^{\prime}:

𝒬(𝜽,𝝅;𝜽,𝝅)\displaystyle\mathcal{Q}(\bm{\theta},\bm{\pi};\bm{\theta}^{\prime},\bm{\pi}^{\prime}) =𝔼𝒚|𝒙;𝜽,𝝅[logP(𝒚,𝒙;𝜽,𝝅)]\displaystyle=\mathbb{E}_{\bm{y}|\bm{x};\bm{\theta}^{\prime},\bm{\pi}^{\prime}}\left[\log P(\bm{y},\bm{x};\bm{\theta},\bm{\pi})\right] (5)
=ilogP(yi|xi;𝜽,𝝅l)\displaystyle=\sum_{i}\log P(y_{i}|x_{i};\bm{\theta},\bm{\pi}_{l})
+j,yP(y|xj;𝜽,𝝅)logP(y|xj;𝜽,𝝅u).\displaystyle+\sum_{j,y}P(y|x_{j};\bm{\theta}^{\prime},\bm{\pi}^{\prime})\log P(y|x_{j};\bm{\theta},\bm{\pi}_{u}).

The E-step involves generating soft pseudo-labels P(y|xj;𝜽,𝝅)P(y|x_{j};\bm{\theta}^{\prime},\bm{\pi}^{\prime}) under the current 𝜽\bm{\theta}^{\prime} and 𝝅\bm{\pi}^{\prime}. These soft pseudo-labels are specifically defined by Eq. 10, which is detailed in Prop. 2. In the subsequent M-step, these pseudo-labels are used alongside the one-hot labels of the labeled data to compute the cross-entropy loss.

M-step (optimizing 𝜽\bm{\theta} and 𝝅\bm{\pi}).

The M-step focuses on optimizing the expected complete data log-likelihood 𝒬\mathcal{Q}-function concerning the parameters 𝜽\bm{\theta} and 𝝅\bm{\pi}.

(a) Optimization of 𝛑\bm{\pi}: The closed-form solution for 𝝅\bm{\pi} can be derived directly from the 𝒬\mathcal{Q}-function (Eq. 5). Specifically, the terms involving 𝝅\bm{\pi} in 𝒬(𝜽,𝝅;𝜽,𝝅)\mathcal{Q}(\bm{\theta},\bm{\pi};\bm{\theta}^{\prime},\bm{\pi}^{\prime}) are given by

ilogP(yi;𝝅l)+j,yP(y|xj;𝜽,𝝅)logP(y;𝝅u).\sum_{i}\log P(y_{i};\bm{\pi}_{l})+\sum_{j,y}P(y|x_{j};\bm{\theta}^{\prime},\bm{\pi}^{\prime})\log P(y;\bm{\pi}_{u}). (6)
Proposition 1 (Closed-form Solution for 𝝅\bm{\pi}).

The optimal 𝛑^\bm{\hat{\pi}} that maximizes 𝒬(𝛉,𝛑;𝛉,𝛑)\mathcal{Q}(\bm{\theta},\bm{\pi};\bm{\theta}^{\prime},\bm{\pi}^{\prime}) is

𝝅^l=1Ni=1Nyi,𝝅^u=1Mj=1MP(y|xj;𝜽,𝝅).\bm{\hat{\pi}}_{l}=\frac{1}{N}\sum_{i=1}^{N}{y_{i}},\quad\bm{\hat{\pi}}_{u}=\frac{1}{M}\sum_{j=1}^{M}P(y|x_{j};\bm{\theta}^{\prime},\bm{\pi}^{\prime}). (7)

(b) Optimization of 𝛉\bm{\theta}: The network parameters 𝜽\bm{\theta}, unlike 𝝅\bm{\pi} which have a closed-form solution, are optimized using standard stochastic gradient descent (SGD). Combining with Sec. 3.1, the terms involving 𝜽\bm{\theta} in 𝒬(𝜽,𝝅;𝜽,𝝅)\mathcal{Q}(\bm{\theta},\bm{\pi};\bm{\theta}^{\prime},\bm{\pi}^{\prime}) are

(i+j,yP(y|xj;𝜽,𝝅))logP(x|y;𝜽)P(x)\displaystyle(\sum_{i}+\sum_{j,y}P(y|x_{j};\bm{\theta}^{\prime},\bm{\pi}^{\prime}))\log\frac{P(x|y;\bm{\theta})}{P(x)} (8)
=\displaystyle= (i+j,yP(y|xj;𝜽,𝝅))logexp(f𝜽(x,y))yϕyexp(f𝜽(x,y)),\displaystyle(\sum_{i}+\sum_{j,y}P(y|x_{j};\bm{\theta}^{\prime},\bm{\pi}^{\prime}))\log\frac{\exp(f_{\bm{\theta}}(x,y))}{\sum_{y^{\prime}}\phi_{y^{\prime}}\exp(f_{\bm{\theta}}(x,y^{\prime}))},

which simplifies to the supervised scenario in Sec. 3.1 by treating P(y|xj;𝜽,𝝅)P(y|x_{j};\bm{\theta}^{\prime},\bm{\pi}^{\prime}) as soft labels. Maximizing Eq. 8 corresponds to minimizing the cross-entropy loss. Here, ϕy\phi_{y^{\prime}} is interpreted as the estimated overall frequency of class yy^{\prime}. The optimization of model parameters 𝜽\bm{\theta} using the overall frequency vector ϕ\bm{\phi} is crucial for learning a Bayes classifier.

Proposition 2 (Bayes Classifier).

In conjunction with the high-confidence filtering (Eq. 1), the optimal ϕ^\bm{\hat{\phi}} for learning a Bayes classifier is mathematically derived as:

ϕ^\displaystyle\bm{\hat{\phi}} =[ϕ^1,ϕ^2,,ϕ^K]\displaystyle=[\hat{\phi}_{1},\hat{\phi}_{2},\cdots,\hat{\phi}_{K}]
=1N+M(iyi+jP(y|xj;𝜽,𝝅)).\displaystyle=\frac{1}{N+M}(\sum_{i}y_{i}+\sum_{j}P(y|x_{j};\bm{\theta}^{\prime},\bm{\pi}^{\prime})). (9)

Subsequently, with the model parameter 𝛉\bm{\theta} which is optimized using the ϕ^\bm{\hat{\phi}}, the corresponding Bayes classifier for unlabeled or test dataset with estimated or uniform class distribution is defined by the equation:

P(y|x;𝜽;𝝅^)\displaystyle P(y|x;\bm{\theta};\bm{\hat{\pi}}) =P(y;𝝅^)exp(f𝜽(x,y))yP(y;𝝅^)exp(f𝜽(x,y)),\displaystyle=\frac{P(y;\bm{\hat{\pi}})\exp(f_{\bm{\theta}}(x,y))}{\sum_{y^{\prime}}P(y^{\prime};\bm{\hat{\pi}})\exp(f_{\bm{\theta}}(x,y^{\prime}))}, (10)
orP(y|x;𝜽)\displaystyle\text{or}\quad P(y|x;\bm{\theta}) =exp(f𝜽(x,y))yexp(f𝜽(x,y)).\displaystyle=\frac{\exp(f_{\bm{\theta}}(x,y))}{\sum_{y^{\prime}}\exp(f_{\bm{\theta}}(x,y^{\prime}))}. (11)

Building upon Prop. 2, it is crucial to acknowledge that the parameter vector ϕ\bm{\phi} is vital for learning Bayes classifiers. Consequently, to delve deeper into the theoretical foundations, we evaluate the impact of ϕ\bm{\phi} on the model’s performance. In line with the principles of online decision theory, we establish a regret bound for the decision error rate on the test set, denoted as P(e;ϕ)P(e;\bm{\phi}). Our analysis is simplified by concentrating on a binary classification scenario, where the labels yy belong to {1,+1}\{-1,+1\}.

Proposition 3 (Regret Bound).

Let ϕ\bm{\phi}^{*} denote the vector ϕ\bm{\phi} obtained in Eq. 9 when pseudo-labels are replaced by ground-truth labels. For the decision error rate P(e;ϕ)P(e;\bm{\phi}) on the test set, the regret bound is expressed as:

P(e;ϕ^)infϕP(e;ϕ)12ϕ+1ϕ1|ϕ^ϕ|,P(e;\bm{\hat{\phi}})-\inf_{\bm{\phi}}P(e;\bm{\phi})\leq\frac{1}{2\phi^{*}_{+1}\phi^{*}_{-1}}|\hat{\phi}-\phi^{*}|, (12)

where |ϕ^ϕ|=|ϕ^+1ϕ+1|=|ϕ^1ϕ1||\hat{\phi}-\phi^{*}|=|\hat{\phi}_{+1}-\phi^{*}_{+1}|=|\hat{\phi}_{-1}-\phi^{*}_{-1}|.

Prop. 3 illustrates that the regret bound is primarily governed by the first-order term of the estimation deviation. Additionally, it is inversely proportional to the ground truth ϕ\bm{\phi}^{*}, highlighting the learning challenges associated with imbalanced training data from a regret-bound perspective.

Algorithm 1 Pseudocode of SimPro in a PyTorch-like style.
# N_e: (K,), where N_e[k] denotes the number of labeled data in class k in one epoch
# pi_u: (K,), the class distribution parameters of unlabeled samples
# phi: (K,), the overall class frequency
# f: deep network parameterized by theta
# alpha, tau, m: hyper-parameters
# CE: CrossEntropyLoss
# aug_w, aug_s: weak and strong augmentation
pi_u.init(uniform)
phi.init(consistent)
for epoch in range(epochs):
pi_e = zeros(K) # temporary estimation of pi_u
# load labeled and unlabeled samples
for (x_l, y_l), x_u in zip(loader_l, loader_u):
# E step: generating pseudo labels
lgt_l = f.forward(aug(x_l))
lgt_w = f.forward(aug_w(x_u)).detach()
lgt_s = f.forward(aug_s(x_u))
# Bayes classifer
psd_lbs = softmax(lgt_w + tau*log(pi_u), dim=-1)
# filter out pseudo labels with high confidence
mask = max(psd_lbs, dim=-1)[0].ge(t)
# M step: solving pi and phi, optimizing theta
# solve pi_u with Eq. (7)
pi_e += sum(psd_lbs[mask], dim=0)
# optimize f (theta) with Eq. (11)
loss_l = CE(lgt_l + tau * log(phi), y_l)
loss_u = mean(CE(lgt_s + tau * log(phi),
psd_lbs, reduction=’none’) * mask)
loss = alpha * loss_l + loss_u
loss.backward()
update(theta)
# update pi_u and phi
phi_e = (pi_e + N_e) / sum(pi_e + N_e) # Eq. (9)
# moving average
phi = m * phi + (1 - m) * phi_e
pi_u = m * pi_u + (1 - m) * pi_e / sum(pi_e)
Table 1: Top-1 accuracy (%\%) on CIFAR10-LT (N1=500,M1=4000N_{1}=500,M_{1}=4000) with different class imbalance ratios γl\gamma_{l} and γu\gamma_{u} under five different unlabeled class distributions. \dagger indicates we reproduce ACR without anchor distributions for a fair comparison.
consistent uniform reversed middle head-tail
γl=150\gamma_{l}=150 γl=100\gamma_{l}=100 γl=150\gamma_{l}=150 γl=100\gamma_{l}=100 γl=150\gamma_{l}=150 γl=100\gamma_{l}=100 γl=150\gamma_{l}=150 γl=100\gamma_{l}=100 γl=150\gamma_{l}=150 γl=100\gamma_{l}=100
γu=150\gamma_{u}=150 γu=100\gamma_{u}=100 γu=1\gamma_{u}=1 γu=1\gamma_{u}=1 γu=1/150\gamma_{u}=1/150 γu=1/100\gamma_{u}=1/100 γu=150\gamma_{u}=150 γu=100\gamma_{u}=100 γu=150\gamma_{u}=150 γu=100\gamma_{u}=100
FixMatch (Sohn et al., 2020) 62.9±\,\pm0.36 67.8±\,\pm1.13 67.6±\,\pm2.56 73.0±\,\pm3.81 59.9±\,\pm0.82 62.5±\,\pm0.94 64.3±\,\pm0.63 71.7±\,\pm0.46 58.3±\,\pm1.46 66.6±\,\pm0.87
w/ CReST+ (Wei et al., 2021) 67.5±\,\pm0.45 76.3±\,\pm0.86 74.9±\,\pm0.80 82.2±\,\pm1.53 62.0±\,\pm1.18 62.9±\,\pm1.39 58.5±\,\pm0.68 71.4±\,\pm0.60 59.3±\,\pm0.72 67.2±\,\pm0.48
w/ DASO (Oh et al., 2022) 70.1±\,\pm1.81 76.0±\,\pm0.37 83.1±\,\pm0.47 86.6±\,\pm0.84 64.0±\,\pm0.11 71.0±\,\pm0.95 69.0±\,\pm0.31 73.1±\,\pm0.68 70.5±\,\pm0.59 71.1±\,\pm0.32
w/ ACR (Wei & Gan, 2023) 70.9±\,\pm0.37 76.1±\,\pm0.42 91.9±\,\pm0.02 92.5±\,\pm0.19 83.2±\,\pm0.39 85.2±\,\pm0.12 73.8±\,\pm0.83 79.3±\,\pm0.30 77.6±\,\pm0.20 79.3±\,\pm0.48
w/ SimPro 74.2±\,\pm0.90 80.7±\,\pm0.30 93.6±\,\pm0.08 93.8±\,\pm0.10 83.5±\,\pm0.95 85.8±\,\pm0.48 82.6±\,\pm0.38 84.8±\,\pm0.54 81.0±\,\pm0.27 83.0±\,\pm0.36
Table 2: Top-1 accuracy (%) on CIFAR100-LT and STL10-LT with different class imbalance ratios γl\gamma_{l} and γu\gamma_{u}. Due to the unknown ground-truth labels of the unlabeled data for STL10, we conduct the experiments by controlling the imbalance ratio of the labeled data. \dagger indicates we reproduce the results of ACR without anchor distributions for fair comparison.
CIFAR100-LT (γl=20,N1=50,M1=400\gamma_{l}=20,N_{1}=50,M_{1}=400) STL10-LT (γu=N/A\gamma_{u}=\emph{N/A})
γu=20\gamma_{u}=20 γu=1\gamma_{u}=1 γu=1/20\gamma_{u}=1/20 γu=20\gamma_{u}=20 γu=20\gamma_{u}=20 N1=450,M=1×105N_{1}=450,\quad M=1\!\times\!10^{5}
consistent uniform reversed middle head-tail γl=10\gamma_{l}=10 γl=20\gamma_{l}=20
FixMatch (Sohn et al., 2020) 40.0±\,\pm0.96 39.6±\,\pm1.16 36.2±\,\pm0.63 39.7±\,\pm0.61 38.2±\,\pm0.82 FixMatch (Sohn et al., 2020) 72.4±\,\pm0.71 64.0±\,\pm2.27
w/ CReST+ (Wei et al., 2021) 40.1±\,\pm1.28 37.6±\,\pm0.88 32.4±\,\pm0.08 36.9±\,\pm0.57 35.1±\,\pm1.10 w/ CReST+ (Wei et al., 2021) 71.5±\,\pm0.96 68.5±\,\pm1.88
w/ DASO (Oh et al., 2022) 43.0±\,\pm0.15 49.4±\,\pm0.93 44.1±\,\pm0.25 43.1±\,\pm1.20 43.8±\,\pm0.43 w/ DASO (Oh et al., 2022) 78.4±\,\pm0.80 75.3±\,\pm0.44
w/ ACR (Wei & Gan, 2023) 40.7±\,\pm0.57 50.2±\,\pm0.82 44.1±\,\pm0.14 42.4±\,\pm0.47 41.1±\,\pm0.09 w/ ACR (Wei & Gan, 2023) 83.0±\,\pm0.32 81.5±\,\pm0.25
w/ SimPro 43.1±\,\pm0.40 52.2±\,\pm0.16 45.5±\,\pm0.34 43.6±\,\pm0.35 44.8±\,\pm0.56 w/ SimPro 84.5±\,\pm0.39 82.5±\,\pm0.25

3.3 Implementation Details

Training objective for optimizing 𝜽\bm{\theta}.

In SimPro, the E-step primarily involves generating pseudo-labels using parameters 𝜽\bm{\theta} and 𝝅\bm{\pi}. Consequently, in the M-step, we first focus on optimizing the network parameter 𝜽\bm{\theta} guided by Eq. 8 via Stochastic Gradient Descent (SGD). Building on the FixMatch algorithm (Sohn et al., 2020), the overall training objective is formulated as:

=αl+u,\mathcal{L}=\alpha\mathcal{L}_{l}+\mathcal{L}_{u}, (13)

where l\mathcal{L}_{l} and u\mathcal{L}_{u} represent the losses on labeled and unlabeled data, respectively. The hyper-parameter α\alpha acts as a scaling factor, the specifics of which are elucidated later.

For l\mathcal{L}_{l}, we modify it (originally the standard cross-entropy loss) following Eq. 8:

l=1Bi=1Blogexp(f𝜽(xi,yi))yϕyτexp(f𝜽(xi,y)),\mathcal{L}_{l}=-\frac{1}{B}\sum_{i=1}^{B}\log\frac{\exp(f_{\bm{\theta}}(x_{i},y_{i}))}{\sum_{y^{\prime}}\phi_{y^{\prime}}^{\tau}\exp(f_{\bm{\theta}}(x_{i},y^{\prime}))}, (14)

where τ\tau is a hyper-parameter for enhancing adaptability to long-tail distributions (Menon et al., 2021). BB is batch size.

For u\mathcal{L}_{u}, we implement the standard SSL format (Eq. 1) and adapt it for ReaLTSSL:

u=1μBj=1μB𝕀(maxy(qy))t)yqylogpy,\displaystyle\mathcal{L}_{u}=-\frac{1}{\mu B}\sum_{j=1}^{\mu B}\mathbb{I}(\text{max}_{y}(q_{y}))\geq t)\sum_{y}q_{y}\log p_{y}, (15)

where μ\mu controls the number of unlabeled samples, and tt is a confidence threshold. The pseudo-label of weak augmentation ω\omega from the Bayes classifier (Eq. 10) is denoted by

qy=P(y|ω(xj);𝜽,𝝅^),q_{y}=P(y|\omega(x_{j});\bm{\theta},\bm{\hat{\pi}}), (16)

and the actual prediction pyp_{y} is obtained using strong augmentation Ω\Omega and calibrated with ϕ\bm{\phi} as shown in Eq. 8:

py=exp(f𝜽(Ω(xj),y))yϕyτexp(f𝜽(Ω(xj),y)).p_{y}=\frac{\exp(f_{\bm{\theta}}(\Omega(x_{j}),y))}{\sum_{y^{\prime}}\phi_{y^{\prime}}^{\tau}\exp(f_{\bm{\theta}}(\Omega(x_{j}),y^{\prime}))}. (17)

Moreover, in practical situations, the size of the unlabeled dataset MM is generally larger than that of the labeled dataset NN. To ensure a balanced sample size in each iteration, we usually set μ=M/N\mu\!=\!M/N in Eq. 15. In specific scenarios, we further adjust the balance factor in Eq. 13 to α=μN/M<1\alpha\!=\!\mu\cdot N/M\!<\!1. This methodology effectively mitigates overfitting to the labeled data (see Tabs. 6 and 7).

Closed-form solution of 𝝅\bm{\pi}.

As discussed in Sec. 3.2, the parameter 𝝅\bm{\pi} of marginal distribution P(y)P(y), has a closed-form solution in the M-step. Therefore, unlike 𝜽\bm{\theta}, which requires SGD optimization, 𝝅\bm{\pi} (Eq. 7) and ϕ\bm{\phi} (Eq. 9) are computed and updated via a moving average during training.

Extended EM algorithm and pseudo-code.

Based on the previous analysis, Our SimPro framework can be summarized as an extended EM algorithm, which includes:

\bm{\cdot} E-step (Eq. 10): Generating pseudo-labels using model parameters 𝜽\bm{\theta} and estimated distribution parameters 𝝅\bm{\pi};

\bm{\cdot} M-step: Optimizing network parameters 𝜽\bm{\theta} via SGD using Eq. 8 (actually using Eq. 13), and solving distribution parameters 𝝅\bm{\pi} and hyper-parameters ϕ\bm{\phi} by Eq. 7 and Eq. 9.

For further clarity, the pseudo-code of SimPro is provided in Alg. 1. The modifications we made to the core training code, in comparison to FixMatch, are highlighted in bold. In the SimPro, we incorporate just a single additional line of code in the M-step to compute the closed-form solution of π\pi (Prop. 1). Furthermore, only four lines of code need to be modified to construct a Bayes classifier (Prop. 2) and to balance the loss between labeled and unlabeled data (denoted as α\alpha). These minor yet crucial adjustments demonstrate that our SimPro framework is not only grounded in rigorous theoretical derivation but is also straightforward to implement in practice, exemplifying both simplicity and elegance.

4 Experiments

In this section, we first present the main results on various ReaLTSSL benchmarks in Sec. 4.1. More analysis, including the ablation studies and the visualization results, is presented in Sec. 4.2 to further evaluate the effectiveness of our SimPro. For detailed information regarding the experimental setup, please refer to App. B.

Table 3: Top-1 accuracy (%\%) on ImageNet-127 (γl=γu286,N128k\gamma_{l}\!=\!\gamma_{u}\approx 286,N_{1}\approx 28\text{k}, and M1250kM_{1}\approx 250\text{k}) and ImageNet-1k (γl=γu=256,N1=256,M1=1024\gamma_{l}\!=\!\gamma_{u}\!=256,N_{1}\!=\!256,M_{1}\!=\!1024) with different test class imbalance ratios γt\gamma_{t} and image resolutions. \dagger indicates we reproduce ACR without anchor distributions for a fair comparison. The results of γt286\gamma_{t}\approx 286 are sourced from ACR (Wei & Gan, 2023).
γt286\gamma_{t}\approx 286
ImageNet-127 32×3232\times 32 64×6464\times 64
FixMatch (Sohn et al., 2020) 29.7 42.3
w/ DARP (Kim et al., 2020) 30.5 42.5
w/ CReST+ (Wei et al., 2021) 32.5 44.7
w/ CoSSL (Fan et al., 2022) 43.7 53.9
w/ ACR (Wei & Gan, 2023) 57.2 63.6
w/ SimPro 59.1 67.0
ImageNet-127 γt=1\gamma_{t}=1
FixMatch (Sohn et al., 2020) 38.7 46.7
w/ ACR (Wei & Gan, 2023) 49.5 56.1
w/ ACR (Wei & Gan, 2023) 50.6 57.3
w/ SimPro 55.7 63.8
ImageNet-1k γt=1\gamma_{t}=1
FixMatch (Sohn et al., 2020)
w/ ACR (Wei & Gan, 2023) 13.2 23.4
w/ ACR (Wei & Gan, 2023) 13.8 23.3
w/ SimPro 19.7 25.0

4.1 Results

We first conduct experiments on the four representative benchmark datasets with different class imbalance ratios. We denote the class imbalance ratio of labeled, unlabeled, and test data as γl\gamma_{l}, γu\gamma_{u}, and γt\gamma_{t}, respectively. Our method is compared with five competitive baseline approaches, i.e., FixMatch (Sohn et al., 2020), CReST+ (Wei et al., 2021), DASO (Oh et al., 2022), ACR (Wei & Gan, 2023), and CPE (Ma et al., 2024). Note that for a fair comparison, we first compare with ACR in the ReaLTSSL setting, where the unlabeled class distribution is unknown and inaccessible. Specifically, we compare our vanilla SimPro with ACR’s variant that removes its pre-defined anchor distributions, denoted as ACR. Then we implement SimPro by also alleviating the anchor distributions in our SimPro framework, comparing SimPro with the original ACR and CPE.

Main results and comparison with SOTA baselines.

The results are presented in Tab. 1 (for CIFAR10-LT), Tab. 2 (for CIFAR100-LT and STL10-LT), and Tab. 3 (for ImageNet-127/1k). It can be concluded that our method consistently outperforms the competitors across all distributions of unlabeled data and achieves SOTA performance. Notably, SimPro exhibits significant performance improvements on our two newly introduced distributions of unlabeled data: middle and head-tail. This substantiates the robust generalization capabilities of SimPro across various distributions that could potentially appear in real-world scenarios.

It is worth noting that compared to CIFAR10/100-LT, STL10-LT is a more challenging dataset that mirrors the real-world data distribution scenarios: an unknown distribution for the unlabeled data. The results in Tab. 2 demonstrate the significant improvements of SimPro over baseline methods.

Moreover, we also conduct experiments on ImageNet-127, whose test dataset is imbalanced and consistent with the labeled data and unlabeled data. However, this is not suitable as a benchmark for long-tail learning, as biased classifiers tend to perform well in such scenarios, which is precisely what we aim to avoid. Therefore, we resample it to achieve a uniform test distribution (γt=1\gamma_{t}=1). The results highlight that our SimPro achieves substantial performance enhancements when evaluated against this balanced test dataset. Beyond this, we further conduct experiments on ImageNet-1k to validate the performance of our method across a broader range of classes. The results in Tab. 3 demonstrate that our SimPro achieves state-of-the-art performance on ImageNet-1k.

Table 4: The impact of the predefined anchor distribution in ACR and CPE (Ma et al., 2024) on CIFAR10-LT with γl=150,N1=500\gamma_{l}=150,N_{1}=500, and M1=4000M_{1}=4000. \star denotes that we use the predefined anchor distributions to estimate P(y|𝝅)P(y|\bm{\pi}) in our SimPro. See more analysis in the main text and more results in App. C.
γu=150\gamma_{u}=150 γu=1\gamma_{u}=1 γu=1/150\gamma_{u}=1/150 γu=150\gamma_{u}=150 γu=150\gamma_{u}=150
consistent uniform reversed middle head-tail
CPE 76.8 81.0 80.8
ACR 77.0 91.3 81.8 77.9 79.0
SimPro 74.2 93.6 83.5 82.6 81.0
SimPro 80.0 94.1 85.0

The results of SimPro using anchor distributions.

To investigate the impact of the anchor distributions in ACR and CPE (Ma et al., 2024), we also incorporate them into our approach, referred to as SimPro. Instead of calculating the distribution distance and adjusting the consistency regularization as in ACR or employing multiple anchor experts as in CPE, our usage of these anchors is more straightforward: after training for five epochs, we calculate the distance between our estimated distribution P(y|𝝅)P(y|\bm{\pi}) and the three anchor distributions (i.e. consistent, uniform, and reversed). This calculation helps us predict the actual distribution and construct the Bayes classifier. Then we fix the marginal distribution parameters 𝝅\bm{\pi} in the remainder of the training.

The results in Tab. 4 indicate that (1) the usage of anchor distributions can significantly enhance the performance of SimPro, consistently outperforming the original ACR and CPE; (2) our estimation for 𝝅\bm{\pi} is accurate (Fig. 5 further validates the accurate estimation of 𝝅\bm{\pi}); (3) when the pre-defined anchors fail to cover the evaluated distributions (middle and head-tail), SimPro outperforms ACR by a large margin; (4) even compared to the original ACR, SimPro exhibits enhanced performance across all scenarios except the consistent distribution. This demonstrates the superior robustness and generalization ability of our method. We believe these advantages guarantee a better adaptable nature for SimPro in real applications and are more valuable than the accuracy improvements when using the anchor distributions.

Evaluation under more imbalance ratios.

Fig. 3 reports the performance of SimPro under more imbalance ratios of unlabeled data. The results indicate that our method consistently outperforms ACR across all imbalance ratios, further substantiating the robustness of our method.

Refer to caption
Figure 3: Test the performance under more imbalance ratios on CIFAR10-LT with γl=150\gamma_{l}=150, N1=500N_{1}=500, and M1=4000M_{1}=4000.
Table 5: Ablation study of estimating the marginal distribution P(y|𝝅)P(y|\bm{\pi}) in M-step (Eq. 7) and using it for constructing Bayes classifier in E-step (Eq. 5). We conduct the experiments on CIFAR10-LT with γl=150,N1=500\gamma_{l}=150,N_{1}=500, and M1=4000M_{1}=4000.
Distribution Estimation γu=150\gamma_{u}=150 γu=1\gamma_{u}=1 γu=1/150\gamma_{u}=1/150 γu=150\gamma_{u}=150 γu=150\gamma_{u}=150
E-step M step consistent uniform reversed middle head-tail
40.7 35.3 43.2 27.1 47.7
64.1 92.6 78.6 64.9 74.8
74.2 93.6 83.5 82.6 81.0

4.2 Analysis

Ablation Study.

Table 6: Ablation study of μ=M/N\mu\!=\!M/N (Eq. 15) on CIFAR10-LT with γl=150,N1=500\gamma_{l}\!=\!150,N_{1}\!=\!500, and M1=4000M_{1}\!=\!4000. For the baseline methods without our Bayes classifier, the performance drops significantly when we set μ=M/N=8\mu\!=\!M/N\!=\!8. This large μ\mu leads to an imbalance between labeled and unlabeled samples in each mini-batch. In contrast, our SimPro is not affected by such imbalance thanks to the Bayes classifier (Prop. 2). Moreover, we effectively leverage the large number of unlabeled data for a more accurate estimate of the marginal distribution parameters 𝝅\bm{\pi}.
γu=150\gamma_{u}=150 γu=1/150\gamma_{u}=1/150 γu=150\gamma_{u}=150 γu=150\gamma_{u}=150
μ=M/N\mu\!=\!M/N consistent reversed middle head-tail
FixMatch 62.9 59.9 64.3 58.3
40.7 43.2 27.1 47.7
  w/ ACR 70.9 83.2 73.8 77.6
68.7 58.9 69.7 72.4
  w/ SimPro 52.7 78.8 58.8 71.5
75.2 83.5 82.6 81.0
Table 7: Abalation study of α\alpha for balancing loss (Eq. 13). The results indicate that α\alpha substantially improves the model’s performance and prevents the model from overfitting to labeled data.
CIFAR10-LT (γu=1\gamma_{u}=1) CIFAR100-LT (γu=1\gamma_{u}=1) STL10-LT (γu=\gamma_{u}= N/A)
N1=500,M1=4000N_{1}=500,M_{1}=4000 N1=50,M1=400N_{1}=50,M_{1}=400 N1=450,M=1×105N_{1}=450,M=1\!\times\!10^{5}
α\alpha γl=150\gamma_{l}=150 γl=100\gamma_{l}=100 γl=20\gamma_{l}=20 γl=20\gamma_{l}=20 γl=10\gamma_{l}=10
92.1 91.2 49.4 76.4 80.0
93.6 93.8 52.2 83.0 85.2
Refer to caption
Figure 4: Sensitive analysis of the confidence threshold tt on CIFAR100-LT with γl=20\gamma_{l}=20, N1=50N_{1}\!=\!50, and M1=400M_{1}\!=\!400 and CIFAR10-LT with γl=150\gamma_{l}\!=\!150, N1=500N_{1}\!=\!500, and M1=4000M_{1}\!=\!4000. The optimal performance is consistently achieved across different settings when the threshold is set at t=0.2t=0.2 and 0.950.95, respectively.
Refer to caption
Figure 5: Visualization of the quality of the estimated distribution on CIFAR10-LT with γl=150\gamma_{l}\!=\!150, N1=500N_{1}\!=\!500, and M1=4000M_{1}\!=\!4000. The KL distances reduce to near-zero values after very few epochs.

We conduct a series of ablation studies to validate the effectiveness of different components in SimPro.

(a) Marginal distribution estimation. We first investigate the impact of the estimation for P(y|𝝅)P(y|\bm{\pi}) (M-step, Prop. 1) and its usage in building the Bayes classifier for pseudo-label generation (E-step, Prop. 2). The results in Tab. 5 substantiate the high effectiveness and the necessity of this estimation in driving the success of our approach, thereby validating our methodology both theoretically and practically.

(b) More unlabeled samples (larger μ\mu) in each iteration. As mentioned in the experimental setup, we manually set the ratio between unlabeled and labeled samples in each training iteration as μ=M/N\mu\!=\!M/N (Eq. 15). In ACR or Fixmatch, this ratio is set as 22. To investigate the impact of this adjustment, we adopt our setting of μ=M/N=8\mu\!=\!M/N\!=\!8 for ACR and Fixmatch. The results in Tab. 6 demonstrate that our method can effectively leverage the unlabeled data for a more accurate estimation of the marginal distribution parameters 𝝅\bm{\pi}. However, the baseline methods suffer from an imbalanced number of labeled/unlabeled samples, because of the absence of our Bayes classifier derived in Prop. 2.

(b) Scaling factor α\alpha. As elucidated in Sec. 3.3, we introduce a scaling factor α=μN/M\alpha\!=\!\mu\cdot N/M (Eq. 13) to mitigate the risk of overfitting. This measure is primarily due to memory and training environment limitations, which restrict the feasible batch size and μ\mu, when MNM\gg N. In the test configurations detailed in Tab. 7, the ratio M/NM/N is about 3030 for CIFAR and 6767 for STL10-LT. An insufficient μ\mu results in a disproportionately high number of labeled data within a mini-batch, potentially leading to overfitting. Hence, we incorporate the α\alpha to balance the losses between labeled and unlabeled data. Our empirical findings demonstrate that the use of this simple parameter α\alpha can significantly enhance model performance, particularly STL10-LT, where there is a substantial disparity between the sizes of labeled and unlabeled data.

Hyperparameter Sensitivity.

As outlined in App. B, we discover that reducing the threshold value improves performance for the CIFAR100-LT dataset. The rationale behind adjusting the confidence threshold is based on the observation that an increase in the number of classes typically results in a corresponding decrease in the confidence of the predictions. Consequently, it becomes necessary to lower the threshold to accommodate this change in confidence levels. Sensitivity analysis regarding the threshold value is presented in Fig. 4. It is consistently observed across different settings that the optimal performance is achieved when the threshold is set at t=0.2t=0.2 and t=0.95t=0.95 for the CIFAR100-LT and CIFAR10-LT, respectively. Moreover, to compare with ACR, we also conduct a sensitivity analysis of the confidence threshold tt for ACR. The results in Fig. 6 of App. C demonstrate that lowering the threshold does not improve performance for ACR.

Visualization of Estimation Quality.

Our study includes a visualization of the estimated distribution quality in Fig. 5. The vertical axis quantifies the Kullback-Leibler (KL) divergence, which measures the deviation of the estimated distribution from the ground truth. The results indicate a significant improvement in estimation accuracy after only a few training epochs. This empirical evidence validates the effectiveness of the theoretically derived estimation method of distribution, as outlined in Prop. 1.

5 Conclusion

In this paper, we introduce SimPro, a novel probabilistic framework for realistic long-tailed semi-supervised learning (ReaLTSSL). This framework represents a pioneering advancement in the field by innovatively enhancing the Expectation-Maximization (EM) algorithm. The key innovation lies in the explicit separation of the estimation process for conditional and marginal distributions. In the M-step, this separation allows for the derivation of a closed-form solution for the marginal distribution parameters. Additionally, SimPro optimizes the parameters of the conditional distribution via gradient descent, facilitating the learning of a Bayes classifier. In the E-step. the Bayes classifier, in turn, generates high-quality pseudo-labels. SimPro is characterized by its elegant theoretical underpinnings and its ease of implementation, which requires only minimal modifications to existing codebases. Furthermore, we incorporate two innovative class distributions specifically for unlabeled data. These distributions provide a more comprehensive and realistic evaluation of ReaLTSSL algorithms. Empirical evidence from various benchmarks demonstrates that SimPro consistently delivers state-of-the-art performance, highlighting its robustness and superior generalization capabilities.

Acknowledgements

This work was supported in part by the National Key R&D Program of China under Grant 2021ZD0140407, in part by the National Natural Science Foundation of China under Grants 62276150, 62321005 and 42327901.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

References

  • Berthelot et al. (2019a) Berthelot, D., Carlini, N., Cubuk, E. D., Kurakin, A., Sohn, K., Zhang, H., and Raffel, C. Remixmatch: Semi-supervised learning with distribution alignment and augmentation anchoring. arXiv preprint, 2019a.
  • Berthelot et al. (2019b) Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., and Raffel, C. A. Mixmatch: A holistic approach to semi-supervised learning. In NeurIPS, 2019b.
  • Coates et al. (2011) Coates, A., Ng, A., and Lee, H. An analysis of single-layer networks in unsupervised feature learning. In AISTATS, volume 15, pp.  215–223. PMLR, 2011.
  • Dempster et al. (1977) Dempster, A. P., Laird, N. M., and Rubin, D. B. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society: series B (methodological), 39(1):1–22, 1977.
  • Deng et al. (2009) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  • Du et al. (2024) Du, C., Wang, Y., Song, S., and Huang, G. Probabilistic contrastive learning for long-tailed visual recognition. TPAMI, 2024.
  • Fan et al. (2022) Fan, Y., Dai, D., Kukleva, A., and Schiele, B. Cossl: Co-learning of representation and classifier for imbalanced semi-supervised learning. In CVPR, 2022.
  • He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In CVPR, 2016.
  • Hong et al. (2021) Hong, Y., Han, S., Choi, K., Seo, S., Kim, B., and Chang, B. Disentangling label distribution for long-tailed visual recognition. In CVPR, 2021.
  • Huang & Du (2022) Huang, G. and Du, C. The high separation probability assumption for semi-supervised learning. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 52(12):7561–7573, 2022.
  • Kang et al. (2020) Kang, B., Xie, S., Rohrbach, M., Yan, Z., Gordo, A., Feng, J., and Kalantidis, Y. Decoupling representation and classifier for long-tailed recognition. In ICLR, 2020.
  • Kim et al. (2020) Kim, J., Hur, Y., Park, S., Yang, E., Hwang, S. J., and Shin, J. Distribution aligning refinery of pseudo-label for imbalanced semi-supervised learning. In NeurIPS, 2020.
  • Koller & Friedman (2009) Koller, D. and Friedman, N. Probabilistic graphical models: principles and techniques. MIT press, 2009.
  • Krizhevsky et al. (2009) Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009.
  • Lai et al. (2022) Lai, Z., Wang, C., Gunawan, H., Cheung, S.-C. S., and Chuah, C.-N. Smoothed adaptive weighting for imbalanced semi-supervised learning: Improve reliability against unknown distribution data. In ICML, 2022.
  • Lee et al. (2013) Lee, D.-H. et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, volume 3, pp.  896. Atlanta, 2013.
  • Lee et al. (2021) Lee, H., Shin, S., and Kim, H. Abc: Auxiliary balanced classifier for class-imbalanced semi-supervised learning. In NeurIPS, 2021.
  • Liu et al. (2019) Liu, Z., Miao, Z., Zhan, X., Wang, J., Gong, B., and Yu, S. X. Large-scale long-tailed recognition in an open world. In CVPR, 2019.
  • Loshchilov & Hutter (2017) Loshchilov, I. and Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. In ICLR, 2017.
  • Ma et al. (2024) Ma, C., Elezi, I., Deng, J., Dong, W., and Xu, C. Three heads are better than one: Complementary experts for long-tailed semi-supervised learning. In AAAI, 2024.
  • Menon et al. (2021) Menon, A. K., Jayasumana, S., Rawat, A. S., Jain, H., Veit, A., and Kumar, S. Long-tail learning via logit adjustment. In ICLR, 2021.
  • Miyato et al. (2018) Miyato, T., Maeda, S.-i., Koyama, M., and Ishii, S. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. TPAMI, 2018.
  • Oh et al. (2022) Oh, Y., Kim, D.-J., and Kweon, I. S. DASO: Distribution-aware semantics-oriented pseudo-label for imbalanced semi-supervised learning. In CVPR, 2022.
  • Ren et al. (2020) Ren, J., Yu, C., Ma, X., Zhao, H., and Yi, S. Balanced meta-softmax for long-tailed visual recognition. In NeurIPS, 2020.
  • Sohn et al. (2020) Sohn, K., Berthelot, D., Carlini, N., Zhang, Z., Zhang, H., Raffel, C. A., Cubuk, E. D., Kurakin, A., and Li, C.-L. FixMatch: Simplifying semi-supervised learning with consistency and confidence. In NeurIPS, 2020.
  • Tarvainen & Valpola (2017) Tarvainen, A. and Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In NeurIPS, 2017.
  • Wang et al. (2023) Wang, Y., Guo, J., Wang, J., Wu, C., Song, S., and Huang, G. Erratum to meta-semi: A meta-learning approach for semi-supervised learning. CAAI Artificial Intelligence Research, 2023.
  • Wei et al. (2021) Wei, C., Sohn, K., Mellina, C., Yuille, A., and Yang, F. CReST: A class-rebalancing self-training framework for imbalanced semi-supervised learning. In CVPR, 2021.
  • Wei & Gan (2023) Wei, T. and Gan, K. Towards realistic long-tailed semi-supervised learning: Consistency is all you need. In CVPR, 2023.
  • Wei et al. (2022) Wei, T., Liu, Q.-Y., Shi, J.-X., Tu, W.-W., and Guo, L.-Z. Transfer and share: semi-supervised learning from long-tailed data. Machine Learning, pp.  1–18, 2022.
  • Zagoruyko & Komodakis (2016) Zagoruyko, S. and Komodakis, N. Wide residual networks. In Procedings of the British Machine Vision Conference 2016. British Machine Vision Association, 2016.

Appendix A Details of the Probabilistic Model

We provide a detailed derivation and analysis to demonstrate that the probabilistic model is correctly defined with an explicit parameterization of p(x;ξ)p(x;\xi). Given the independence of parameters through conditional parameterization (Koller & Friedman, 2009), we can decompose the joint probability distribution as follows:

p(x,y;θ,π,ξ)=p(x;ξ)p(y|x;θ,π)=p(y;π)p(x|y;θ,ξ).p(x,y;\theta,\pi,\xi)=p(x;\xi)p(y|x;\theta,\pi)=p(y;\pi)p(x|y;\theta,\xi). (18)

Applying Bayes’ rule, we obtain:

p(y|x;θ,π)p(y;π)=p(x|y;θ,ξ)p(x;ξ).\frac{p(y|x;\theta,\pi)}{p(y;\pi)}=\frac{p(x|y;\theta,\xi)}{p(x;\xi)}. (19)

It is evident that π\pi and ξ\xi appear only on the left and right sides of the equation, respectively, indicating that the equation is neither a function of π\pi nor ξ\xi but is parameterized solely by θ\theta. We define the above equation as g(x,y;θ)g(x,y;\theta), that is:

p(x|y;θ,ξ)p(x;ξ)=p(y|x;θ,π)p(y;π)=g(x,y;θ).\frac{p(x|y;\theta,\xi)}{p(x;\xi)}=\frac{p(y|x;\theta,\pi)}{p(y;\pi)}=g(x,y;\theta). (20)

Returning to the equation in the main paper, we have:

p(y|x;θ,π)=p(y;π)p(x|y;θ,ξ)p(x;ξ)=p(y;π)g(x,y;θ).p(y|x;\theta,\pi)=\frac{p(y;\pi)p(x|y;\theta,\xi)}{p(x;\xi)}=p(y;\pi)g(x,y;\theta). (21)

Although we explicitly parameterize p(x;ξ)p(x;\xi), it is clear that p(y|x;θ,π)p(y|x;\theta,\pi) is parameterized solely by θ\theta and π\pi, and is independent of ξ\xi. In fact, the fitting target of the network parameters θ\theta is g(x,y;θ)=p(x|y;θ,ξ)/p(x;ξ)g(x,y;\theta)=p(x|y;\theta,\xi)/p(x;\xi).

Since we did not explicitly parameterize p(x)p(x) in our framework, when xx is not a condition, the parameters of the relevant notions omit the parameters of p(x)p(x), such as p(x)p(x), p(x|y)p(x|y), p(x,y)p(x,y), etc. This may lead to a potential misunderstanding, as:

p(x)=yp(x|y;θ)p(y;π)p(x)=\sum_{y}p(x|y;\theta)p(y;\pi) (22)

suggests that p(x)p(x) seems to be parameterized by θ\theta and π\pi. However, if we recover the omitted parameter of p(x)p(x), we have:

p(x;ξ)=yp(x|y;θ,ξ)p(y;π)=p(x;ξ)yg(x,y;θ)p(y;π)=p(x;ξ)yp(y|x;θ,π)=p(x;ξ),p(x;\xi)=\sum_{y}p(x|y;\theta,\xi)p(y;\pi)=p(x;\xi)\sum_{y}g(x,y;\theta)p(y;\pi)=p(x;\xi)\sum_{y}p(y|x;\theta,\pi)=p(x;\xi), (23)

which is consistent with the explicit parameterization of p(x;ξ)p(x;\xi).

Therefore, the probabilistic model is correctly defined without the explicit parameterization of p(x)p(x).

Appendix B Experimental Setup

Datasets.

To validate the effectiveness of SimPro, we employ five commonly used SSL datasets, CIFAR10/100 (Krizhevsky et al., 2009), STL10 (Coates et al., 2011), ImageNet-127 (Fan et al., 2022) and orginal ImageNet-1k (Deng et al., 2009). Following the methodology described in ACR (Wei & Gan, 2023), we denote the number of samples for each category in the labeled, unlabeled dataset as N1NKN_{1}\!\geq\!\cdots\!\geq\!N_{K}, M1MKM_{1}\!\geq\!\cdots\!\geq\!M_{K}, respectively, where 1,,K1,\cdots,K are the class indices. We define γl\gamma_{l}, γu\gamma_{u}, and γt\gamma_{t} as the class imbalance ratios for labeled, unlabeled, and test data, respectively. We specify ‘LT’ for those imbalanced variants. These ratios are calculated as γl=N1/NK\gamma_{l}\!=\!N_{1}/N_{K} and γu=M1/MK\gamma_{u}\!=\!M_{1}/M_{K}. The sample number of the kk-th class follows an exponential distribution, expressed as Nk=N1γlk1K1N_{k}\!=\!N_{1}\cdot\gamma_{l}^{-\frac{k-1}{K-1}} for labeled and Mk=M1γuk1K1M_{k}\!=\!M_{1}\cdot\gamma_{u}^{-\frac{k-1}{K-1}} for unlabeled data.

As illustrated in Fig. 1, for the CIFAR10/100 datasets, we constructed five class distributions to test the performance of different algorithms under more general settings. Regarding the STL10 dataset, due to the unknown ground-truth labels of the unlabeled data, we approach the experiments by controlling the imbalance ratio of the labeled data.

For ImageNet-127, we follow the original setting in Fan et al. (2022) (γl=γu=γt286\gamma_{l}\!=\!\gamma_{u}\!=\!\gamma_{t}\approx 286). Nevertheless, this approach does not serve as an appropriate benchmark for long-tail learning. In these scenarios, biased classifiers often exhibit high performance, which is exactly the outcome we seek to prevent. Consequently, we also resample the test dataset to ensure a uniform distribution (γt=1\gamma_{t}\!=\!1). Following Fan et al. (2022), we reduce the image resolution to 32×3232\times 32 and 64×6464\times 64 in response to resource constraints.

Training hyper-parameters.

Our experimental setup mainly follows FixMatch (Sohn et al., 2020) and ACR (Wei & Gan, 2023). For example, we employ the Wide ResNet-28-2 (Zagoruyko & Komodakis, 2016) on CIFAR10/100 and STL10, and ResNet-50 (He et al., 2016) on ImageNet-127. All models are optimized with SGD. Several settings are slightly different from those in ACR: as outlined in Sec. 3.3, to achieve a balanced training, we set the ratio between unlabeled and labeled samples in each batch as μ=M/N\mu\!=\!M/N (88 on CIFAR, 1616 on STL10-LT, and 22 on ImageNet-127, ), where M,NM,N are the total sample numbers of unlabeled/labeled data. In contrast, μ\mu is set as 22 for all datasets in ACR. The effectiveness of this adjustment is validated in Tab. 6.

The batch size for labeled data is 6464 on CIFAR10/100 and STL10-LT, and 3232 on ImageNet-127. To ensure a fair comparison, the training epochs are reduced to 8686 on CIFAR10/100 and STL10-LT, and 500500 on ImageNet-127. The initial learning rate η\eta is linearly scaled to 0.170.17 on CIFAR10/100 and STL10-LT, and 0.010.01 on ImageNet-127, which decays with a cosine schedule (Loshchilov & Hutter, 2017) as in ACR.

Regarding the hyperparameter τ\tau used in Eq. 14, we follow the guidelines from Menon et al. (2021) and set τ=2.0\tau\!=\!2.0 and 1.01.0 for CIFAR10-LT/STL10-LT and CIFAR100-LT/ImageNet-127, respectively.

For the confidence threshold tt in Eq. 1), we set t=0.95t\!=\!0.95 on CIFAR10-LT/STL10-LT following Sohn et al. (2020). We adjust it to 0.20.2 on CIFAR100-LT/Imagenet-127, as we observe that reducing the threshold enhances performance (Fig. 4).

Specifically, the settings on ImageNet-1k are identical to those on ImageNet-127.

Appendix C More Experimental Results

Table 8: The impact of the predefined anchor distribution in ACR (Wei & Gan, 2023) on CIFAR10-LT with γl=100,N1=500\gamma_{l}=100,N_{1}=500, and M1=4000M_{1}=4000. \star denotes that we use the predefined anchor distributions to estimate P(y|𝝅)P(y|\bm{\pi}) in our SimPro.
γu=100\gamma_{u}=100 γu=1\gamma_{u}=1 γu=1/100\gamma_{u}=1/100 γu=100\gamma_{u}=100 γu=100\gamma_{u}=100
consistent uniform reversed middle head-tail
ACR 81.6 92.1 85.0 73.6 79.8
SimPro 80.7 93.8 85.8 84.8 83.0
SimPro 82.7 94.3 86.0
Table 9: The impact of the predefined anchor distribution in ACR (Wei & Gan, 2023) on CIFAR100-LT with γl=20,N1=50\gamma_{l}=20,N_{1}=50, and M1=400M_{1}=400. \star denotes that we use the predefined anchor distributions to estimate P(y|𝝅)P(y|\bm{\pi}) in our SimPro.
γu=20\gamma_{u}=20 γu=1\gamma_{u}=1 γu=1/20\gamma_{u}=1/20 γu=20\gamma_{u}=20 γu=20\gamma_{u}=20
consistent uniform reversed middle head-tail
ACR 44.9 52.2 42.3 42.6 42.6
SimPro 43.1 52.3 45.5 43.6 44.8
SimPro 45.9 53.8 46.0
Refer to caption
Figure 6: Sensitive analysis of the confidence threshold tt for ACR (Wei & Gan, 2023) on CIFAR100-LT with γl=20\gamma_{l}=20, N1=50N_{1}\!=\!50, and M1=400M_{1}\!=\!400 and CIFAR10-LT with γl=150\gamma_{l}\!=\!150, N1=500N_{1}\!=\!500, and M1=4000M_{1}\!=\!4000. The optimal performance is achieved across different settings when the threshold is set at 0.950.95.

Appendix D Proof of Proposition 1

Proof.

We employ the method of Lagrange multipliers to find the optimal values of 𝝅^l\bm{\hat{\pi}}_{l} and 𝝅^u\bm{\hat{\pi}}_{u} that maximize the 𝒬\mathcal{Q} function subject to the constraints of probability distributions (i.e., the elements of 𝝅l\bm{\pi}_{l} and 𝝅u\bm{\pi}_{u} must sum to 1). Let λl\lambda_{l} and λu\lambda_{u} be the Lagrange multipliers for these constraints. The Lagrangian \mathcal{L} can be formulated as 111yy is the one-hot label, but when yy is the subscript of a variable, it represents the yy-th category, that is, ϕy=ϕargmaxy\phi_{y}=\phi_{\mathop{\mathrm{argmax}}y}. we omit this difference without affecting the understanding.:

(𝝅l,𝝅u,λl,λu)\displaystyle\mathcal{L}(\bm{\pi}_{l},\bm{\pi}_{u},\lambda_{l},\lambda_{u}) =ilogP(yi;𝝅l)+j,yP(y|xj;𝜽,𝝅)logP(y;𝝅u)\displaystyle=\sum_{i}\log P(y_{i};\bm{\pi}_{l})+\sum_{j,y}P(y|x_{j};\bm{\theta}^{\prime},\bm{\pi}^{\prime})\log P(y;\bm{\pi}_{u})
λl(yπly1)λu(yπuy1).\displaystyle-\lambda_{l}\left(\sum_{y}\pi_{ly}-1\right)-\lambda_{u}\left(\sum_{y}\pi_{uy}-1\right). (24)

The partial derivatives of \mathcal{L} with respect to 𝝅l\bm{\pi}_{l} and 𝝅u\bm{\pi}_{u} are calculated as follows:

𝝅l\displaystyle\frac{\partial\mathcal{L}}{\partial\bm{\pi}_{l}} =𝝅l(ilogP(yi;𝝅l))λl𝝅l(yπly1),\displaystyle=\frac{\partial}{\partial\bm{\pi}_{l}}\left(\sum_{i}\log P(y_{i};\bm{\pi}_{l})\right)-\lambda_{l}\frac{\partial}{\partial\bm{\pi}_{l}}\left(\sum_{y}\pi_{ly}-1\right), (25)
𝝅u\displaystyle\frac{\partial\mathcal{L}}{\partial\bm{\pi}_{u}} =𝝅u(j,yP(y|xj;𝜽,𝝅)logP(y;𝝅u))λu𝝅u(yπuy1).\displaystyle=\frac{\partial}{\partial\bm{\pi}_{u}}\left(\sum_{j,y}P(y|x_{j};\bm{\theta}^{\prime},\bm{\pi}^{\prime})\log P(y;\bm{\pi}_{u})\right)-\lambda_{u}\frac{\partial}{\partial\bm{\pi}_{u}}\left(\sum_{y}\pi_{uy}-1\right). (26)

Due to fact that P(y;𝝅)P(y;\bm{\pi}) is a categorical distribution, the derivatives are:

πly\displaystyle\frac{\partial\mathcal{L}}{\partial\pi_{ly}} =i𝟏{yi=y}πlyλl,\displaystyle=\sum_{i}\frac{\mathbf{1}_{\{y_{i}=y\}}}{\pi_{ly}}-\lambda_{l}, (27)
πuy\displaystyle\frac{\partial\mathcal{L}}{\partial\pi_{uy}} =jP(y|xj;𝜽,𝝅)1πuyλu.\displaystyle=\sum_{j}P(y|x_{j};\bm{\theta}^{\prime},\bm{\pi}^{\prime})\frac{1}{\pi_{uy}}-\lambda_{u}. (28)

Setting these partial derivatives to zero and solving for πly\pi_{ly} and πuy\pi_{uy}:

π^ly\displaystyle\hat{\pi}_{ly} =i𝟏{yi=y}λl,\displaystyle=\frac{\sum_{i}\mathbf{1}_{\{y_{i}=y\}}}{\lambda_{l}}, (29)
π^uy\displaystyle\hat{\pi}_{uy} =jP(y|xj;𝜽,𝝅)λu.\displaystyle=\frac{\sum_{j}P(y|x_{j};\bm{\theta}^{\prime},\bm{\pi}^{\prime})}{\lambda_{u}}. (30)

Applying the constraint that the sum of probabilities equals 1, we get:

yπ^ly=1\displaystyle\sum_{y}\hat{\pi}_{ly}=1 λl=yi1=N,\displaystyle\Rightarrow\lambda_{l}=\sum_{y}\sum_{i}1=N, (31)
yπ^uy=1\displaystyle\sum_{y}\hat{\pi}_{uy}=1 λu=yjP(y|xj;𝜽,𝝅)=M.\displaystyle\Rightarrow\lambda_{u}=\sum_{y}\sum_{j}P(y|x_{j};\bm{\theta}^{\prime},\bm{\pi}^{\prime})=M. (32)

Therefore, the optimal solutions are:

𝝅^l=1Ni=1Nyi,𝝅^u=1Mj=1MP(y|xj;𝜽,𝝅).\bm{\hat{\pi}}_{l}=\frac{1}{N}\sum_{i=1}^{N}{y_{i}},\quad\bm{\hat{\pi}}_{u}=\frac{1}{M}\sum_{j=1}^{M}P(y|x_{j};\bm{\theta}^{\prime},\bm{\pi}^{\prime}). (33)

Appendix E Proof of Proposition 2

Proof.

We structure our proof in two parts: First, we validate the proposition when training exclusively with labeled data. Then, we extend our analysis to include scenarios incorporating unlabeled data. This approach stems from our threshold-based strategy for filtering low-confidence pseudo labels in the training process. Initially, only labeled data are used, gradually integrating unlabeled data as training progresses.

Case 1: Labeled Data Only

Our proof begins by revisiting the definition of 𝒬(𝜽,𝝅;𝜽,𝝅)\mathcal{Q}(\bm{\theta},\bm{\pi};\bm{\theta}^{\prime},\bm{\pi}^{\prime}) with respect to 𝜽\bm{\theta}:

𝒬(𝜽)\displaystyle\mathcal{Q}(\bm{\theta}) =ilogexp(f𝜽(xi,yi))yϕyexp(f𝜽(x,y))\displaystyle=\sum_{i}\log\frac{\exp(f_{\bm{\theta}}(x_{i},y_{i}))}{\sum_{y^{\prime}}\phi_{y^{\prime}}\exp(f_{\bm{\theta}}(x,y^{\prime}))}
=i(logϕyiexp(f𝜽(xi,yi))yϕyexp(f𝜽(x,y))logϕy).\displaystyle=\sum_{i}\left(\log\frac{\phi_{y_{i}}\exp(f_{\bm{\theta}}(x_{i},y_{i}))}{\sum_{y^{\prime}}\phi_{y^{\prime}}\exp(f_{\bm{\theta}}(x,y^{\prime}))}-\log\phi_{y}\right). (34)

Ignoring the constant term, maximizing 𝒬(𝜽)\mathcal{Q}(\bm{\theta}) is equivalent to minimizing the empirical risk:

Remp(𝜽)=1Nilogϕyiexp(f𝜽(xi,yi))yϕyexp(f𝜽(x,y)).R_{\text{emp}}(\bm{\theta})=-\frac{1}{N}\sum_{i}\log\frac{\phi_{y_{i}}\exp(f_{\bm{\theta}}(x_{i},y_{i}))}{\sum_{y^{\prime}}\phi_{y^{\prime}}\exp(f_{\bm{\theta}}(x,y^{\prime}))}. (35)

The empirical risk serves as an approximation of the expected risk, with 𝒟l\mathcal{D}_{l} denoting the distribution of labeled data:

Rexp(𝜽)\displaystyle R_{\text{exp}}(\bm{\theta}) =𝔼(x,y)𝒟llogQ(y|x)\displaystyle=-\mathbb{E}_{(x,y)\sim\mathcal{D}_{l}}\log Q(y|x)
=xyP(y|x)logQ(y|x)dx\displaystyle=-\int_{x}\sum_{y}P(y|x)\log Q(y|x)\,\mathop{}\!\mathrm{d}x
=xH(P(y|x))+DKL(P(y|x)||Q(y|x))dx,\displaystyle=\int_{x}H(P(y|x))+D_{\text{KL}}(P(y|x)||Q(y|x))\,\mathop{}\!\mathrm{d}x, (36)

where Q(y|x)Q(y|x) is defined as:

Q(y|x)=ϕyexp(f𝜽(x,y))yϕyexp(f𝜽(x,y)).Q(y|x)=\frac{\phi_{y}\exp(f_{\bm{\theta}}(x,y))}{\sum_{y^{\prime}}\phi_{y^{\prime}}\exp(f_{\bm{\theta}}(x,y^{\prime}))}. (37)

With the non-negativity and zero-equality conditions of KL divergence, minimizing the expected risk implies:

ϕyexp(f𝜽(x,y))yϕyexp(f𝜽(x,y))=P(y|x)=πlyP(x|y)Pl(x).\frac{\phi_{y}\exp(f_{\bm{\theta}}(x,y))}{\sum_{y^{\prime}}\phi_{y^{\prime}}\exp(f_{\bm{\theta}}(x,y^{\prime}))}=P(y|x)=\frac{\pi_{ly}P(x|y)}{P_{l}(x)}. (38)

We aim to validate the Bayes classifier:

P(y|x;𝜽,𝝅)=P(y;𝝅)exp(f𝜽(x,y))yP(y;𝝅)exp(f𝜽(x,y)),P(y|x;\bm{\theta},\bm{\pi})=\frac{P(y;\bm{\pi})\exp(f_{\bm{\theta}}(x,y))}{\sum_{y^{\prime}}P(y^{\prime};\bm{\pi})\exp(f_{\bm{\theta}}(x,y^{\prime}))}, (39)

which leads to the formulation:

πyP(x|y)P^(x)=πyexp(f𝜽(x,y))yπyexp(f𝜽(x,y)).\frac{\pi_{y}P(x|y)}{\hat{P}(x)}=\frac{\pi_{y}\exp(f_{\bm{\theta}}(x,y))}{\sum_{y^{\prime}}\pi_{y^{\prime}}\exp(f_{\bm{\theta}}(x,y^{\prime}))}. (40)

Integrating Eq. 38 with Eq. 40 yields:

ϕyyϕyexp(f𝜽(x,y))=πlyP^(x)Pl(x)yπyexp(f𝜽(x,y)).\frac{\phi_{y}}{\sum_{y^{\prime}}\phi_{y^{\prime}}\exp(f_{\bm{\theta}}(x,y^{\prime}))}=\frac{\pi_{ly}\hat{P}(x)}{P_{l}(x)\sum_{y^{\prime}}\pi_{y^{\prime}}\exp(f_{\bm{\theta}}(x,y^{\prime}))}. (41)

Summing over yy in Eq. 41 leads to:

yϕy=C=P^(x)yϕyexp(f𝜽(x,y))Pl(x)yπyexp(f𝜽(x,y))].\sum_{y}\phi_{y}=C=\frac{\hat{P}(x)\sum_{y^{\prime}}\phi_{y^{\prime}}\exp(f_{\bm{\theta}}(x,y^{\prime}))}{P_{l}(x)\sum_{y^{\prime}}\pi_{y^{\prime}}\exp(f_{\bm{\theta}}(x,y^{\prime}))]}. (42)

Substituting Eq. 42 into Eq. 41 results in:

ϕy=Cπly.\phi_{y}=C\pi_{ly}. (43)

As the constant CC becomes irrelevant in the logarithmic term of 𝒬(𝜽)\mathcal{Q}(\bm{\theta}), in light of Prop. 1, we deduce the optimal ϕ^\bm{\hat{\phi}}:

ϕ^=1Ni=1Nyi.\bm{\hat{\phi}}=\frac{1}{N}\sum_{i=1}^{N}{y_{i}}. (44)

Case 2: Labeled and Unlabeled Data

Expanding our proof to include both labeled and unlabeled data, our optimization objective remains consistent: maximizing 𝒬(𝜽)\mathcal{Q}(\bm{\theta}) by minimizing the empirical risk:

Remp(𝜽)=1N+M(i+j,yP(y|xj;𝜽,𝝅))logQ(y|x).R_{\text{emp}}(\bm{\theta})=-\frac{1}{N+M}(\sum_{i}+\sum_{j,y}P(y|x_{j};\bm{\theta}^{\prime},\bm{\pi}^{\prime}))\log Q(y|x). (45)

This risk approximates the expected risk over 𝒟\mathcal{D}:

Rexp(𝜽)=𝔼(x,y)𝒟logQ(y|x),R_{\text{exp}}(\bm{\theta})=-\mathbb{E}_{(x,y)\sim\mathcal{D}}\log Q(y|x), (46)

where 𝒟\mathcal{D} represents the mixture distribution of both labeled and unlabeled data and its density is:

P𝒟(x,y)=NM+NPl(x,y)+MM+NPu(x)P(y|x;𝜽,𝝅).P_{\mathcal{D}}(x,y)=\frac{N}{M+N}P_{l}(x,y)+\frac{M}{M+N}P_{u}(x)P(y|x;\bm{\theta}^{\prime},\bm{\pi}^{\prime}). (47)

Acknowledging that P(y|x;𝜽,𝝅)P(y|x;\bm{\theta}^{\prime},\bm{\pi}^{\prime}) is a Bayes classifier, we conclude:

P𝒟(x,y)=(NM+Nπly+MM+Nπuy)P(x|y),P_{\mathcal{D}}(x,y)=\left(\frac{N}{M+N}\pi_{ly}+\frac{M}{M+N}\pi_{uy}\right)P(x|y), (48)

which leads to the formulation:

P𝒟(y)=NM+Nπly+MM+Nπuy,P𝒟(x|y)=P(x|y).P_{\mathcal{D}}(y)=\frac{N}{M+N}\pi_{ly}+\frac{M}{M+N}\pi_{uy},\quad P_{\mathcal{D}}(x|y)=P(x|y). (49)

Following the same logic as in the labeled data case:

ϕyexp(f𝜽(x,y))yϕyexp(f𝜽(x,y))=P𝒟(y|x)=P𝒟(y)P(x|y)P𝒟(x).\frac{\phi_{y}\exp(f_{\bm{\theta}}(x,y))}{\sum_{y^{\prime}}\phi_{y^{\prime}}\exp(f_{\bm{\theta}}(x,y^{\prime}))}=P_{\mathcal{D}}(y|x)=\frac{P_{\mathcal{D}}(y)P(x|y)}{P_{\mathcal{D}}(x)}. (50)

This results in:

ϕy=CP𝒟(y).\phi_{y}=C\cdot P_{\mathcal{D}}(y). (51)

In accordance with Prop. 1, we determine the optimal ϕ^\bm{\hat{\phi}} as:

ϕ^\displaystyle\bm{\hat{\phi}} =NM+N𝝅^l+MM+N𝝅^u\displaystyle=\frac{N}{M+N}\bm{\hat{\pi}}_{l}+\frac{M}{M+N}\bm{\hat{\pi}}_{u}
=1N+M(iyi+jP(y|xj;𝜽,𝝅)).\displaystyle=\frac{1}{N+M}(\sum_{i}y_{i}+\sum_{j}P(y|x_{j};\bm{\theta}^{\prime},\bm{\pi}^{\prime})). (52)

Appendix F Proof of Proposition 3

Proof.

We begin by examining P(e;ϕ^)P(e;\bm{\hat{\phi}}). The decision error rate is mathematically defined as follows:

P(e)=xP(e|x)P(x)dx,P(e|x)={P(1|x)if decision is +1;P(+1|x)if decision is 1.P(e)=\int_{x}P(e|x)P(x)\,\mathop{}\!\mathrm{d}x,\quad P(e|x)=\begin{cases}P(-1|x)&\text{if decision is }+1;\\ P(+1|x)&\text{if decision is }-1.\end{cases} (53)

This formulation quantifies the error in decision-making by integrating the conditional error rates across all possible outcomes.

Building upon the Bayes optimal classifier as outlined in Eq. 11, the posterior probability essential for decision-making on the test set is expressed as:

Pd(y|x)exp(f𝜽(x,y)).P_{d}(y|x)\propto\exp(f_{\bm{\theta}}(x,y)). (54)

Here, f𝜽(x,y)f_{\bm{\theta}}(x,y) represents the model’s discriminative function, parameterized by 𝜽\bm{\theta}, for decision outcome yy given an input xx.

The ground truth distribution, as approximated by parameters ϕ^\bm{\hat{\phi}} and 𝜽\bm{\theta}, is denoted by P(y;ϕ)P(y;\bm{\phi^{*}}) and P(x|y)P(x|y), respectively. The formal relationship between the estimated and true distributions is expressed as follows:

ϕ^yexp(f𝜽(x,y))yϕ^yexp(f𝜽(x,y))=ϕyP(x|y)yϕyP(x|y),\frac{\hat{\phi}_{y}\exp(f_{\bm{\theta}}(x,y))}{\sum_{y^{\prime}}\hat{\phi}_{y^{\prime}}\exp(f_{\bm{\theta}}(x,y^{\prime}))}=\frac{\phi^{*}_{y}P(x|y)}{\sum_{y^{\prime}}\phi^{*}_{y^{\prime}}P(x|y^{\prime})}, (55)

indicating a proportional relationship between the model’s estimation and the true data distribution.

Integrating the formulation of posterior probability with the relationship between estimated and true distributions allows us to derive the following expression:

Pd(y|x)\displaystyle P_{d}(y|x) exp(f𝜽(x,y))\displaystyle\propto\exp(f_{\bm{\theta}}(x,y))
ϕyϕ^yP(x|y)\displaystyle\propto\frac{\phi^{*}_{y}}{\hat{\phi}_{y}}P(x|y)
ϕyϕ^yP(y|x),\displaystyle\propto\frac{\phi^{*}_{y}}{\hat{\phi}_{y}}P(y|x), (56)

where the final step is justified by the fact that the class distribution in the test set is uniform.

The decision criterion is thus formulated as:

decision={+1if l(x)λ;1if l(x)λ,\text{decision}=\begin{cases}+1&\text{if }l(x)\geq\lambda;\\ -1&\text{if }l(x)\leq\lambda,\end{cases} (57)

where

l(x)=P(+1|x)P(1|x),λ=ϕ^+1ϕ1ϕ^1ϕ+1.l(x)=\frac{P(+1|x)}{P(-1|x)},\quad\lambda=\frac{\hat{\phi}_{+1}\phi^{*}_{-1}}{\hat{\phi}_{-1}\phi^{*}_{+1}}. (58)

We assume λ1\lambda\leq 1 without loss of generality, due to the symmetric nature of the decision problem.

The decision error rate is then expressed as:

P(e;ϕ^)=xP(e|x;ϕ^)P(x)dx,P(e|x;ϕ^)={P(1|x)if l(x)λ;P(+1|x)if l(x)λ.P(e;\bm{\hat{\phi}})=\int_{x}P(e|x;\bm{\hat{\phi}})P(x)\,\mathop{}\!\mathrm{d}x,\quad P(e|x;\bm{\hat{\phi}})=\begin{cases}P(-1|x)&\text{if }l(x)\geq\lambda;\\ P(+1|x)&\text{if }l(x)\leq\lambda.\end{cases} (59)

Minimizing the decision error rate is achieved when ϕ^=ϕ\bm{\hat{\phi}}=\bm{\phi}^{*}, leading to the Bayes decision error rate, which is the theoretical lower bound of error rates across all possible distributions.

infϕP(e;ϕ)=P(e;ϕ)=xP(e|x;ϕ)P(x)dx,P(e|x;ϕ)={P(1|x)if l(x)1;P(+1|x)if l(x)1.\inf_{\bm{\phi}}P(e;\bm{\phi})=P(e;\bm{\phi^{*}})=\int_{x}P(e|x;\bm{\phi^{*}})P(x)\,\mathop{}\!\mathrm{d}x,\quad P(e|x;\bm{\phi^{*}})=\begin{cases}P(-1|x)&\text{if }l(x)\geq 1;\\ P(+1|x)&\text{if }l(x)\leq 1.\end{cases} (60)

Finally, comparing P(e;ϕ^)P(e;\bm{\hat{\phi}}) with the optimal decision error rate, we explore the difference:

P(e;ϕ^)infϕP(e;ϕ)\displaystyle P(e;\bm{\hat{\phi}})-\inf_{\bm{\phi}}P(e;\bm{\phi}) =λl(x)1|P(1|x)P(+1|x)|P(x)dx\displaystyle=\int_{\lambda\leq l(x)\leq 1}|P(-1|x)-P(+1|x)|P(x)\,\mathop{}\!\mathrm{d}x
=λl(x)1|1l(x)|P(1|x)P(x)dx\displaystyle=\int_{\lambda\leq l(x)\leq 1}|1-l(x)|P(-1|x)P(x)\,\mathop{}\!\mathrm{d}x
(1λ)λl(x)1P(1,x)dx\displaystyle\leq(1-\lambda)\int_{\lambda\leq l(x)\leq 1}P(-1,x)\,\mathop{}\!\mathrm{d}x
(1λ)xP(1,x)dx\displaystyle\leq(1-\lambda)\int_{x}P(-1,x)\,\mathop{}\!\mathrm{d}x
=(1λ)P(1)\displaystyle=(1-\lambda)P(-1)
=1λ2\displaystyle=\frac{1-\lambda}{2}
=ϕ^1ϕ12ϕ+1ϕ^1\displaystyle=\frac{\hat{\phi}_{-1}-\phi^{*}_{-1}}{2\phi^{*}_{+1}\hat{\phi}_{-1}}
12ϕ+1ϕ1(ϕ^1ϕ1)\displaystyle\leq\frac{1}{2\phi^{*}_{+1}\phi^{*}_{-1}}(\hat{\phi}_{-1}-\phi^{*}_{-1})
=12ϕ+1ϕ1|ϕ^ϕ|,\displaystyle=\frac{1}{2\phi^{*}_{+1}\phi^{*}_{-1}}|\hat{\phi}-\phi^{*}|, (61)

where the final inequality is justified by the condition that ϕ^1ϕ1\hat{\phi}_{-1}\geq\phi^{*}_{-1} due to λ1\lambda\leq 1.