This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Uncertainty Herding: One Active Learning Method for All Label Budgets

Wonho Bae
University of British Columbia & Borealis AI
whbae@cs.ubc.ca &Gabriel L. Oliveira
Borealis AI
gabriel.oliveira@borealisai.com &Danica J. Sutherland
University of British Columbia & Amii
dsuth@cs.ubc.ca
These authors contributed equally.
Abstract

Most active learning research has focused on methods which perform well when many labels are available, but can be dramatically worse than random selection when label budgets are small. Other methods have focused on the low-budget regime, but do poorly as label budgets increase. As the line between “low” and “high” budgets varies by problem, this is a serious issue in practice. We propose uncertainty coverage, an objective which generalizes a variety of low- and high-budget objectives, as well as natural, hyperparameter-light methods to smoothly interpolate between low- and high-budget regimes. We call greedy optimization of the estimate Uncertainty Herding; this simple method is computationally fast, and we prove that it nearly optimizes the distribution-level coverage. In experimental validation across a variety of active learning tasks, our proposal matches or beats state-of-the-art performance in essentially all cases; it is the only method of which we are aware that reliably works well in both low- and high-budget settings.

1 Introduction

In active learning, rather than being provided a dataset of input-output pairs as in passive learning, a model strategically requests annotations for specific unlabeled inputs. The aim is to learn a good model while minimizing the number of required output annotations. This procedure is generally iterative: a model is initially trained on a small, labeled dataset, then selects the most “informative” data points from an unlabeled pool to annotate. This is particularly useful when labeling is expensive or time-consuming. For example, manual annotations of medical imaging by radiologists or pathologists may be especially time-consuming and costly. Measuring whether a compound interacts with a certain biological compound may require slow, high-accuracy chemical simulations or even lab experiments. Discovering a customer’s product preferences may require giving them many offers, which is slow, potentially expensive, and may produce a poor customer experience.

The most popular line of work in active learning has used notions of uncertainty to measure how informative each candidate data point is expected to be, and selects data points for labeling to maximize that measure. Although these uncertainty-based approaches often work well in the experimental settings where they are evaluated, Hacohen et al. (2022) and Yehuda et al. (2022) have shown that when there are few total labeled data points – called the low-budget setting – they can be substantially worse than random selection, presumably because the model’s estimate of uncertainty is not yet reliable. To address this, they (and some others) have proposed methods that prioritize “representative” data points, often built on clustering methods such as kk-means. These methods can work substantially better in low-budget regimes, but themselves often saturate performance and do worse than uncertainty-based selection once budgets are large enough.

In practice it is difficult to know whether a given budget is “high” or “low” for a particular problem; it greatly depends on the particular dataset and model architecture. Hacohen & Weinshall (2024) proposed an algorithm, SelectAL, to select whether to use a high- or low-budget method. This approach, however, assumes discrete budget regimes when there is often not a clear boundary, and because of the form of the algorithm is also unable to consider uncertainty-based active learning measures directly. SelectAL also requires re-training models many times, which may be computationally infeasible, and requires a nontrivial amount of data holdout for validation, an issue when budgets are low. Perhaps most importantly, the algorithm appears quite sensitive to small, subtle decisions; our attempts at replication111The authors have not publicly released code, though they indicated they plan to in private communication. gave extremely inconsistent and unreliable estimates of the regime, overall yielding much worse performance than reported in the paper.

This motivates the aim of finding a single active learning algorithm which can seamlessly adapt from low- to high-budget regimes. While there have been various hybrid methods combining representation and uncertainty, we find in practice that none of these methods work well in low-budget settings. We therefore propose an objective called uncertainty coverage, adding a notion of uncertainty to the “generalized coverage” of Bae et al. (2024) and its greedy optimizer MaxHerding. We call greedy optimization of the empirical estimate of the uncertainty coverage Uncertainty Herding (UHerding); we prove UHerding nearly maximizes the true uncertainty coverage.

The uncertainty coverage agrees with the generalized coverage in one extreme setting of parameters, while agreeing with uncertainty measures in another. To naturally interpolate between those settings, we propose a simple method to adaptively and automatically adjust these parameters such that the objective moves itself from mostly representation-based to mostly uncertainty-based behavior. With this parameter adaptation scheme, we demonstrate that UHerding outperforms MaxHerding (and all other methods) in low-budget regimes, while also outperforming uncertainty sampling (and all other methods) in high-budget regimes, across several benchmark datasets (CIFAR-10 and -100, TinyImageNet, DomainNet, and ImageNet) in both standard supervised and transfer learning settings. Furthermore, we describe how several existing hybrid active learning methods are closely related to UHerding, and confirm that our parameter adaptation schemes also benefit existing hybrid methods.

2 Related Work and Background

The most common framework in active learning is pool-based active learning. At each step t{1,2,,T}t\in\{1,2,\cdots,T\}, a labeled set t\mathcal{L}_{t} 𝒳\subseteq\mathcal{X} is iteratively expanded by querying a set of new data points 𝒮t={𝐱b}b=1Bt\mathcal{S}_{t}=\{\mathbf{x}_{b}\}_{b=1}^{B_{t}} from an unlabeled pool of data points 𝒰t\mathcal{U}_{t}, 𝒳\subseteq\mathcal{X} where 𝒳\mathcal{X} denotes the support set of the data distribution of interest. A model is then trained on the new t+1\mathcal{L}_{t+1}. Usually, the most important component is determining which points to annotate.

Uncertainty-based Methods

“Myopic” methods that rely only on a current model’s predictions include entropy (Wang & Shang, 2014), margin Scheffer et al. (2001), confidence, and posterior probability (Lewis & Catlett, 1994; Lewis & Gale, 1994). In the Bayesian setting, BALD (Gal et al., 2017; Kirsch et al., 2019) uses mutual information between labels and model parameters. BAIT (Ash et al., 2021) tries to select data points that minimize Bayes risk. Instead of using a snapshot of a trained model, Kye et al. (2023) exploit uncertainties computed in the process of model training.

Some models focus on “looking ahead” to predict how a data point will change the model. These include methods based on expected changes in model parameters (Settles, 2009; Settles et al., 2007; Ash et al., 2020), expected changes in model predictions (Freytag et al., 2014; Käding et al., 2016; 2018), and expected error reduction (Roy & McCallum, 2001; Zhu et al., 2003; Guo & Greiner, 2007). These approaches are primarily used with simple models like linear and Naïve Bayes models. For deep models, a neural tangent kernel-based linearization (Mohamadi et al., 2022) can be used, though it offers limited improvement over uncertainty sampling relative to its computational cost.

Representation-based Methods

These methods select data points that represent (or cover) the data distribution. Traditional methods include kk-means (Xu et al., 2003), medoids (Aghaee et al., 2016) and medians (Voevodski et al., 2012). Hacohen et al. (2022) show that selecting representative points is particularly helpful in low-budget regimes, and propose Typiclust, which selects the “most typical” points in each cluster. Bıyık et al. (2019) utilize determinantal point processes instead.

Another approach is to minimize distance between the labeled and unlabeled data distributions, whether kernel MMD (Chen et al., 2010), Wasserstein distance (Mahmood et al., 2022), or an estimate of the KL divergence tailored to transfer learning as in ActiveFT (Xie et al., 2023).

Sener & Savarese (2018) convert the objective of active learning into maximum coverage, proposing greedy kk-center. Instead of finding the minimum radius to cover all data points, ProbCover (Yehuda et al., 2022) greedily select data points to cover the most data points with a fixed radius. While its performance is sensitive to the choice of radius (also difficult to set), MaxHerding (Bae et al., 2024) generalizes to a continuous notion of coverage, generalized coverage (or GCoverage), which is less sensitive to parameter choice. As we build directly on this method, we describe it in more detail.

GCoverage is defined in terms of a function k:𝒳×𝒳×(𝒳𝒱)0k:\mathcal{X}\times\mathcal{X}\times(\mathcal{X}\to\mathcal{V})\to\mathbb{R}_{\geq 0}, which computes a similarity between 𝐱\mathbf{x} and 𝐱\mathbf{x}^{\prime} based on a feature mapping g:𝒳𝒱g:\mathcal{X}\to\mathcal{V}. Bae et al. (2024) mostly use the Gaussian kernel222Although we call the function kσk_{\sigma} and use the term “kernel,” it is not generally necessary that the function be positive semi-definite as in kernel methods, nor that it integrate to 1 as in kernel density estimation. kσ(𝐱,𝐱;g)=exp(g(𝐱)g(𝐱)2/σ2)k_{\sigma}(\mathbf{x},\mathbf{x}^{\prime};g)=\exp\left(-\left\lVert g(\mathbf{x})-g(\mathbf{x}^{\prime})\right\rVert^{2}/\sigma^{2}\right) with gg based on self-supervised feature extractors such as SimCLR (Chen et al., 2020). The GCoverage and its estimator are

Ckσ(𝒮):=𝔼𝐱[max𝐱𝒮kσ(𝐱,𝐱;g)]1Nn=1N(max𝐱𝒮kσ(𝐱n,𝐱;g))=:C^kσ(𝒮)with 𝐱n𝒰.\displaystyle\mathrm{C}_{k_{\sigma}}(\mathcal{S}):=\mathbb{E}_{\mathbf{x}}\left[\max_{\mathbf{x}^{\prime}\in\mathcal{S}}k_{\sigma}(\mathbf{x},\mathbf{x}^{\prime};g)\right]\approx\frac{1}{N}\sum_{n=1}^{N}\left(\max_{\mathbf{x}^{\prime}\in\mathcal{S}}k_{\sigma}(\mathbf{x}_{n},\mathbf{x}^{\prime};g)\right)=:\widehat{\mathrm{C}}_{k_{\sigma}}(\mathcal{S})\,\text{with }\mathbf{x}_{n}\in\mathcal{U}. (1)

MaxHerding greedily maximizes the estimated GCoverage: 𝐱argmax𝐱~𝒰C^kσ({𝐱~})\mathbf{x}^{*}\in\operatorname*{arg\,max}_{\tilde{\mathbf{x}}\in\mathcal{U}}\widehat{\mathrm{C}}_{k_{\sigma}}(\mathcal{L}\cup\{\tilde{\mathbf{x}}\}). This is a (11e)\left(1-\frac{1}{e}\right) approximation algorithm for optimizing the monotone submodular function C^kσ\widehat{\mathrm{C}}_{k_{\sigma}}.

Hybrid Methods

These methods aim to select informative yet representative data points. Nguyen & Smeulders (2004); Donmez et al. (2007) use margin-based selection weighted by clustering scores. Settles & Craven (2008) weight uncertainty measures like entropy by cosine similarity. For neural networks, BADGE (Ash et al., 2020) uses kk-means++ on loss gradient space, while ALFA-Mix (Parvaneh et al., 2022) applies kk-means to uncertain points based on feature interpolation.

Since uncertainty and representation-based active learning approaches behave differently in different budget regimes, SelectAL (Hacohen & Weinshall, 2024) and TCM (Doucet et al., 2024) provide methods to decide when to switch from low-budget to high-budget methods. TCM provides some insights for a transition point, but their insights are based on extensive experimentation, and hard to generalize to different settings. SelectAL was discussed in the previous section. In this work, we propose a more robust approach with minimal re-training, covering continuous budget regimes.

Clustering in active learning

kk-means is widely used in active learning to promote diversity in selection. BADGE, ALFA-Mix, and Typiclust, for example, use kk-means or variants.

kk-means centroids, however, do not in general correspond to any available point, and thus other methods (such as Typiclust’s density criterion) must be used to choose a point from a cluster. It would be natural to instead enforce centroids to be data points, yielding kk-medoids. The common alternating update scheme similar to kk-means often leads to poor local optima (Schubert & Rousseeuw, 2021). The Partitioning Around Medoids (PAM) algorithm (Kaufman & Rousseeuw, 2009; Schubert & Rousseeuw, 2019; Schubert & Lenssen, 2022) gives better clusters, but is much slower. MaxHerding selects points with essentially equivalent downstream performance as using the far more expensive Faster PAM algorithm (Schubert & Rousseeuw, 2021) for GCoverage.

As we demonstrate empirically in Figure 8(b), MaxHerding achieves significantly better performance than both kk-means and kk-means++ for active learning. Thus, in Section 3.4, we replace kk-means with MaxHerding for the theoretical analysis of some clustering-based active learning methods.

Refer to caption
(a) Comparison of ProbCover, GCoverage, and UCoverage.
Refer to caption
(b) Temperature τ\tau, lengthscale σ\sigma.
Figure 1: Left: illustration of coverages (Section 3.1). Right: parameter adaptation (Section 3.2).

3 Method

We introduce a novel approach called Uncertainty Herding (UHerding), designed to “interpolate” between the state-of-the-art representation-based method MaxHerding and any choice of uncertainty-based method (e.g., Margin), providing effectiveness across different budget regimes.

3.1 Uncertainty Coverage

We first define a measure of how much uncertainty a set of data points covers.

Definition 1.

For any subset 𝒮𝒳\mathcal{S}\subset\mathcal{X}, a nonnegative-valued function kσk_{\sigma},333We typically use kσ(𝐱,𝐱;g)=ψ((g(𝐱)g(𝐱))/σ)[0,1]k_{\sigma}(\mathbf{x},\mathbf{x}^{\prime};g)=\psi((g(\mathbf{x})-g(\mathbf{x}^{\prime}))/\sigma)\in[0,1] for some function ψ\psi and a feature map gg. and nonnegative-valued uncertainty function UU, uncertainty coverage (UCoverage) is defined and empirically estimated as

UCkσ(𝒮)=𝔼𝐱[U(𝐱;f)max𝐱𝒮kσ(𝐱,𝐱;g)]1Nn=1NU(𝐱n;f)max𝐱𝒮kσ(𝐱n,𝐱;g)=UC^kσ(𝒮).\mathrm{UC}_{k_{\sigma}}(\mathcal{S})=\mathbb{E}_{\mathbf{x}}\left[U(\mathbf{x};f)\max_{\mathbf{x}^{\prime}\in\mathcal{S}}k_{\sigma}(\mathbf{x},\mathbf{x}^{\prime};g)\right]\approx\frac{1}{N}\sum_{n=1}^{N}U(\mathbf{x}_{n};f)\max_{\mathbf{x}^{\prime}\in\mathcal{S}}k_{\sigma}(\mathbf{x}_{n},\mathbf{x}^{\prime};g)=\widehat{\mathrm{UC}}_{k_{\sigma}}(\mathcal{S}).

Here, the uncertainty function UU is based on a model ff which is updated as the model trains, while kσk_{\sigma} uses a fixed feature extractor gg. UCoverage weights the GCoverage of (1) with a choice of uncertainty measure U(𝐱;f)U(\mathbf{x};f); choosing U(𝐱;f)=1U(\mathbf{x};f)=1 immediately recovers GCoverage.

In Figure 1(a), we visualize differences in assessing coverage. ProbCover (left) treats all data points within a σ\sigma-radius ball around selected points equally, assigning uniform weight to each. GCover (middle) introduces a smooth proximity measure by applying a kernel function, such as the RBF kernel, to weigh nearby points more effectively. UCover (right) additionally incorporates uncertainty. The green circles represent uncertainty, with their size proportional to each point’s uncertainty. Ultimately, UCover evaluates how well the selected data points account for uncertainty across the space.

We now show UC^kσ\widehat{\mathrm{UC}}_{k_{\sigma}} is a good estimator of UCkσ\mathrm{UC}_{k_{\sigma}}. All proofs are in Appendix A. {restatable}theoremuniconv Let U(𝐱;f)[0,Umax]U(\mathbf{x};f)\in[0,U_{\max}], kσ(𝐱,𝐱;g)=k~σ(g(𝐱),g(𝐱))[0,1]k_{\sigma}(\mathbf{x},\mathbf{x}^{\prime};g)=\tilde{k}_{\sigma}(g(\mathbf{x}),g(\mathbf{x}^{\prime}))\in[0,1], {g(𝐱):𝐱𝒰}{𝐭d:𝐭R}\{g(\mathbf{x}):\mathbf{x}\in\mathcal{U}\}\subseteq\{\mathbf{t}\in\mathbb{R}^{d}:\left\lVert\mathbf{t}\right\rVert\leq R\}, and |k~σ(𝐭,𝐭)k~σ(𝐭,𝐭′′)|Lσ𝐭𝐭′′\left\lvert\tilde{k}_{\sigma}(\mathbf{t},\mathbf{t}^{\prime})-\tilde{k}_{\sigma}(\mathbf{t},\mathbf{t}^{\prime\prime})\right\rvert\leq L_{\sigma}\left\lVert\mathbf{t}^{\prime}-\mathbf{t}^{\prime\prime}\right\rVert. Let 𝒳\mathcal{L}\subseteq\mathcal{X} be arbitrary and fixed. Assume B/N<16R2B/N<16R^{2}. Then, with probability at least 1δ1-\delta over the choice of the NN iid data points in 𝒰𝒳\mathcal{U}\subseteq\mathcal{X} used to estimate UC^kσ\widehat{\mathrm{UC}}_{k_{\sigma}}, all size-BB sets 𝒮\mathcal{S} (not only subsets of 𝒰\,\mathcal{U}) have low error:

sup𝒮𝒳|𝒮|=B|UCkσ(𝒮)UC^kσ(𝒮)|UmaxBN[8Lσ+12dlog(R2NB)+2Blog2δ].\raisebox{3.44444pt}{$\displaystyle\sup_{\begin{subarray}{c}\mathcal{S}\subseteq\mathcal{X}\\ \left\lvert\mathcal{S}\right\rvert=B\end{subarray}}$}\left\lvert\mathrm{UC}_{k_{\sigma}}(\mathcal{L}\cup\mathcal{S})-\widehat{\mathrm{UC}}_{k_{\sigma}}(\mathcal{L}\cup\mathcal{S})\right\rvert\leq U_{\max}\sqrt{\frac{B}{N}}\left[8L_{\sigma}+\frac{1}{2}\sqrt{d\log\left(R^{2}\frac{N}{B}\right)+\frac{2}{B}\log\frac{2}{\delta}}\,\right].

Since typically BNB\ll N – we query up to perhaps a few hundred points at a time, out of a dataset of at least tens of thousands but perhaps millions – even optimizing our noisy estimate of the uncertainty coverage does not introduce substantial error.444It is worth emphasizing that, although Bae et al. (2024) mentioned a simple Hoeffding bound for C^kσ\widehat{\mathrm{C}}_{k_{\sigma}}, this bound only applies to fixed 𝒮\mathcal{S} independent of 𝒰\mathcal{U} – while in reality 𝒮𝒰\mathcal{S}\subseteq\mathcal{U}. Ignoring this problem and taking a union bound over all (NB)\binom{N}{B} subsets of 𝒰\mathcal{U} would yield a uniform convergence bound of Umax12Nlog((NB)2δ)UmaxB2N[log(eNB)+1Blog2δ]U_{\max}\sqrt{\frac{1}{2N}\log\left(\binom{N}{B}\frac{2}{\delta}\right)}\leq U_{\max}\sqrt{\frac{B}{2N}\left[\log\left(e\frac{N}{B}\right)+\frac{1}{B}\log\frac{2}{\delta}\right]}. The rate of Definition 1 is very similar, with the advantage of being correct. If UU is given by the margin between probabilistic predictions, then Umax1U_{\max}\leq 1; other notions can also be easily bounded. Assuming kσ1k_{\sigma}\leq 1 is for convenience, but a different upper bound can simply be absorbed into UmaxU_{\max}. The bound is indeed 1 for the Gaussian kernel on gg used by Bae et al. (2024) (as well as by us); this kernel also has Lσ=2e1σL_{\sigma}=\sqrt{\frac{2}{e}}\cdot\frac{1}{\sigma}. The kernel which corresponds to the probability coverage of Yehuda et al. (2022), however, is not Lipschitz, suggesting why it is so sensitive to σ\sigma. Self-supervised representations in gg are often normalized to R=1R=1, and usually have dd at most a few hundred.

Input: Initial labeled set 0\mathcal{L}_{0}, Initial unlabeled set 𝒰0\mathcal{U}_{0}, a set of temperatures 𝒯\mathcal{T}, the number of iterations T, a set of query budgets {Bt}t=0T1\{B_{t}\}_{t=0}^{T-1}, a classifier ff, and a feature extractor gg
1 for t[0,1,,T1]t\in[0,1,\cdots,T-1] do
       // Parameter adaptation
2       Compute τ=argminτ𝒯ECE(fttrain,tval)\tau^{*}=\operatorname*{arg\,min}_{\tau\in\mathcal{T}}\text{ECE}(f^{\mathcal{L}_{t}^{\text{train}}},\mathcal{L}_{t}^{\text{val}}) where ttrain\mathcal{L}_{t}^{\text{train}} and tval\mathcal{L}_{t}^{\text{val}} are random split from t\mathcal{L}_{t}, and fttrainf^{\mathcal{L}_{t}^{\text{train}}} refers to a classifier ff trained on ttrain\mathcal{L}_{t}^{\text{train}}
3       Compute 𝐤|𝒰t|\mathbf{k}\in\mathbbm{R}^{\lvert\mathcal{U}_{t}\rvert} with 𝐤n=max𝐱tkσ(𝐱n,𝐱)\mathbf{k}_{n}=\max_{\mathbf{x}^{\prime}\in\mathcal{L}_{t}}k_{\sigma^{*}}(\mathbf{x}_{n},\mathbf{x}^{\prime}) for σ=min𝐮,𝐯t,𝐮𝐯D(𝐮,𝐯;g)\sigma^{*}=\min_{\mathbf{u},\mathbf{v}\in\mathcal{L}_{t},\mathbf{u}\neq\mathbf{v}}D(\mathbf{u},\mathbf{v};g)
       // Greedy selection based on the uncertainty coverage
4       for b[1,2,,Bt]b\in[1,2,\cdots,B_{t}] do
5             Select 𝐱b=argmax𝐱~𝒰1Nn=1NU(xn;fτ)max(kσ(𝐱n,𝐱~)𝐤n,0)\mathbf{x}_{b}^{*}=\operatorname*{arg\,max}_{\tilde{\mathbf{x}}\in\mathcal{U}}\frac{1}{N}\sum_{n=1}^{N}U(x_{n};f_{\tau^{*}})\cdot\max(k_{\sigma^{*}}(\mathbf{x}_{n},\tilde{\mathbf{x}})-\mathbf{k}_{n},0)
6             Update 𝐤nmax(kσ(𝐱n,𝐱b),𝐤n),n|𝒰t|\mathbf{k}_{n}\leftarrow\max(k_{\sigma^{*}}(\mathbf{x}_{n},\mathbf{x}_{b}^{*}),\mathbf{k}_{n}),\,\forall n\in\lvert\mathcal{U}_{t}\rvert
7      Update t+1t{𝐱b}b=1Bt\mathcal{L}_{t+1}\leftarrow\mathcal{L}_{t}\cup\{\mathbf{x}_{b}^{*}\}_{b=1}^{B_{t}} and 𝒰t+1𝒰t\{𝐱b}b=1Bt\mathcal{U}_{t+1}\leftarrow\mathcal{U}_{t}\,\backslash\,\{\mathbf{x}_{b}^{*}\}_{b=1}^{B_{t}}
Algorithm 1 Uncertainty herding with parameter adaptation

3.2 Parameter Adaptation

We wish to choose parameters of UCoverage such that it smoothly changes from behaving like generalized coverage in low-budget regimes to behaving like uncertainty in high-budget regimes.

Handling the low-budget case: calibration

When |𝒮|\left\lvert\mathcal{S}\right\rvert is small, we would like to roughly replicate GCoverage. We can do this by making our uncertainty function constant: {restatable}propositiontogcover If 𝐱𝒰,U(𝐱;f)c\,\forall\mathbf{x}\in\mathcal{U},U(\mathbf{x};f)\rightarrow c where c0c\geq 0, the estimated UCoverage UC^kσ(𝒮)\widehat{\mathrm{UC}}_{k_{\sigma}}(\mathcal{S}) approaches the estimated GCoverage C^kσ(𝒮)\widehat{\mathrm{C}}_{k_{\sigma}}(\mathcal{S}), up to a constant. When ||\left\lvert\mathcal{L}\right\rvert is very small, our model ff is bad. Models with poor predictive power will generally have near-constant uncertainty if they are well-calibrated. We thus encourage calibration primarily through temperature scaling, a simple but effective post-hoc calibration method (Guo et al., 2017):

  1. 1.

    Split t\mathcal{L}_{t} into ttrain\mathcal{L}_{t}^{\text{train}} and tval\mathcal{L}_{t}^{\text{val}}, and train a model ff on ttrain\mathcal{L}_{t}^{\text{train}}, obtaining fttrainf^{\mathcal{L}_{t}^{\text{train}}}.

  2. 2.

    Choose τ\tau^{*} among some candidate set 𝒯\mathcal{T} to minimize the expected calibration error (ECE)

    (Naeini et al., 2015) of the temperature-scaled predictions f(𝐱)/τf(\mathbf{x})/\tau on tval\mathcal{L}_{t}^{\text{val}}. Here f(𝐱)Kf(\mathbf{x})\in\mathbbm{R}^{K} denotes a logit vector, with KK the number of classes.

  3. 3.

    Compute uncertainties with τ\tau^{*}-scaled softmax: logp^τ(𝐱)fτ(𝐱):=f(𝐱)/τ\log\hat{p}_{\tau^{*}}(\mathbf{x})\propto{f_{\tau^{*}}(\mathbf{x})}\vcentcolon={f(\mathbf{x})}/\tau^{*}.

The selected temperatures τ\tau^{*} are generally large when ||\lvert\mathcal{L}\rvert is small and so ff has poor predictions, which makes U(𝐱;f)U(\mathbf{x};f) close to constant. As ||\lvert\mathcal{L}\rvert increases and ff’s predictions improve, τ\tau^{*} decreases (see Figure 1(b), left panel), making uncertainty values more distinct.

Handling the high-budget case: decreasing σ\sigma

As ||\left\lvert\mathcal{L}\right\rvert increases, the effect of a single new data point on the trained model tends to become more “semantically local,” implying it is reasonable to treat points as “covering” only closer and closer points in gg space by decreasing the radius σ\sigma. As a heuristic, we choose the radius to be the minimum distance between data points in the labeled set \mathcal{L}: σ=min𝐮,𝐯t,𝐮𝐯g(𝐮)g(𝐯)\sigma^{*}=\min_{\mathbf{u},\mathbf{v}\in\mathcal{L}_{t},\mathbf{u}\neq\mathbf{v}}\left\lVert g(\mathbf{u})-g(\mathbf{v})\right\rVert. Since 𝒰\mathcal{U} is bounded, as ||\left\lvert\mathcal{L}\right\rvert grows we have that σ0\sigma^{*}\to 0 (see Figure 1(b), right panel); thus Section 3.2 eventually applies, becoming uncertainty-based selection.

{restatable}

propositiontounc Suppose kσ(𝐱,𝐱;g)=ψ((g(𝐱)g(𝐱))/σ)k_{\sigma}(\mathbf{x},\mathbf{x}^{\prime};g)=\psi\bigl{(}(g(\mathbf{x})-g(\mathbf{x}^{\prime}))/\sigma\bigr{)} for a fixed g:𝒳dg:\mathcal{X}\to\mathbb{R}^{d} which is injective on 𝒰\mathcal{U}, and a function ψ:d[0,1]\psi:\mathbb{R}^{d}\to[0,1] with ψ(0)=1\psi(0)=1 and for all tdt\in\mathbb{R}^{d} with t=1\left\lVert t\right\rVert=1, limaψ(at)=0\lim_{a\to\infty}\psi(at)=0 . If σ0\sigma\to 0, the estimated uncertainty coverage UC^kσ(𝒮)\widehat{\mathrm{UC}}_{k_{\sigma}}(\mathcal{S}) approaches s=1|𝒮|U(𝐱s;f)\sum_{s=1}^{\lvert\mathcal{S}\rvert}U(\mathbf{x}_{s};f), up to a constant.

As we shall see in Section 4.4, UCoverage with fixed τ\tau and σ\sigma is not robust across budget levels, performing worse than MaxHerding in low-budget and worse than Margin in high-budget regimes. With our adaption techniques, however, UCoverage outperforms competitors across label budgets.

Refer to caption
(a) Margin
Refer to caption
(b) MaxHerding
Refer to caption
(c) UHerding
Figure 2: Comparison of Margin, MaxHerding and proposed UHerding on half-moon toy data.

3.3 Uncertainty Herding

To obtain an actual active learning method, we still need an algorithm to maximize UC^kσ\widehat{\mathrm{UC}}_{k_{\sigma}}. We could select a batch by finding 𝒮¯argmax𝒮𝒰,|𝒮|=BUC^kσ(𝒮)\bar{\mathcal{S}}\in\operatorname*{arg\,max}_{\mathcal{S}\subseteq\mathcal{U},\lvert\mathcal{S}\rvert=B}\widehat{\mathrm{UC}}_{k_{\sigma}}(\mathcal{L}\cup\mathcal{S}). This is equivalent to the weighted kernel kk-medoids objective, with weights determined by uncertainty U(𝐱;f)U(\mathbf{x};f); thus, we could use Partitioning Around Medoids (PAM) (Kaufman & Rousseeuw, 2009) to try to find 𝒮¯\bar{\mathcal{S}}.

Bae et al. (2024), however, observed that even with a highly optimized implementation, this algorithm is much slower to optimize GCoverage than greedy methods, with little improvement in active learning performance. We thus focus on greedy selection, which we call Uncertainty Herding (UHerding) by analogy to MaxHerding (itself an analogy to kernel herding, Chen et al., 2010).

Definition 2 (Uncertainty Herding).

To greedily add a single data point to a set =𝒮\mathcal{L}^{\prime}=\mathcal{L}\cup\mathcal{S}, select

𝐱\displaystyle\mathbf{x}^{*} argmax𝐱~𝒰UC^kσ({𝐱~})=argmax𝐱~𝒰(1Nn=1NU(𝐱n;fτ)max𝐱{𝐱~}kσ(𝐱n,𝐱;g)).\displaystyle\in\operatorname*{arg\,max}_{\tilde{\mathbf{x}}\in\mathcal{U}}\widehat{\mathrm{UC}}_{k_{\sigma}}(\mathcal{L}^{\prime}\cup\{\tilde{\mathbf{x}}\})=\operatorname*{arg\,max}_{\tilde{\mathbf{x}}\in\mathcal{U}}\left(\frac{1}{N}\sum_{n=1}^{N}U(\mathbf{x}_{n};f_{\tau})\cdot\max_{\mathbf{x}^{\prime}\in\mathcal{L}^{\prime}\cup\{\tilde{\mathbf{x}}\}}k_{\sigma}(\mathbf{x}_{n},\mathbf{x}^{\prime};g)\right). (2)

UHerding selects a new batch 𝒮\mathcal{S} of size |𝒮|\left\lvert\mathcal{S}\right\rvert by picking one point at a time to add to \mathcal{L}^{\prime}.

UHerding improves uncertainty measures by accounting for both the uncertainty of a selected point and its influence on reducing nearby uncertainty. It improves on MaxHerding by incorporating uncertainty, putting less weight on covering already-certain points.

{restatable}

corollarycoveroptim In the setting of Definition 1, let 𝒮^𝒰\hat{\mathcal{S}}\subseteq\mathcal{U} be the result of UHerding for BB steps to add to \mathcal{L}, and UC=max𝒮𝒰,|𝒮|=BUCkσ(𝒮)\mathrm{UC}^{*}=\max_{{\mathcal{S}\subseteq\mathcal{U},\left\lvert\mathcal{S}\right\rvert=B}}\mathrm{UC}_{k_{\sigma}}(\mathcal{L}\cup\mathcal{S}) the optimal coverage obtainable among 𝒰\mathcal{U}. Then

UCkσ(𝒮^)(11e)UC(21e)UmaxBN[8Lσ+12dlog(R2NB)+2Blog2δ].\mathrm{UC}_{k_{\sigma}}(\mathcal{L}\cup\hat{\mathcal{S}})\geq\left(1-\frac{1}{e}\right)\mathrm{UC}^{*}-\left(2-\frac{1}{e}\right)U_{\max}\sqrt{\frac{B}{N}}\left[8L_{\sigma}+\frac{1}{2}\sqrt{d\log\left(R^{2}\frac{N}{B}\right)+\frac{2}{B}\log\frac{2}{\delta}}\,\right].

Figure 2 visually compares the next selected data points (in red) by Margin, MaxHerding, and UHerding (using Margin uncertainty) on a two-class half-moon dataset represented by \mathbin{\vbox{\hbox{\scalebox{1.7}{$\circ$}}}} and \mathbin{\vbox{\hbox{\scalebox{1.0}{$\bm{\vartriangle}$}}}}. A logistic regression model with fifth-order polynomial features is trained at each iteration, with the decision boundary after training on 12 previously selected points (in green) shown as dashed lines. As expected, Margin selects points near the decision boundary, which can lead to suboptimal models if the predicted boundary deviates from the true one. MaxHerding selects the most representative points based on prior selections but ignores model predictions; this gives quick generalization with few labeled points, but performance saturates over time as it neglects points near the boundary. UHerding balances these approaches by initially selecting representative points and gradually focusing on uncertain points near the boundary, improving performance over time.

3.4 Connection to Hybrid Methods

UHerding is closely connected to existing hybrid active learning methods. As mentioned in Section 2, we replace kk-means and kk-means++ in various algorithms with greedy kernel kk-medoids (MaxHerding) to simplify our arguments; this does not harm their effectiveness. We also apply a self-supervised feature extractor gg instead of a feature extractor from a classifier ff, since feature embeddings from gg are more informative when ff is trained on a small labeled set. Finally, we often assume that kσk_{\sigma} is in fact a positive-definite (RKHS) kernel; this is true for the kernels we use.

{restatable}

[Weighted kk-means of Zhdanov 2019]propositionwkmeans Define an uncertainty measure U(𝐱;f)U(\mathbf{x};f) from another uncertainty measure U(𝐱;f)U(\mathbf{x};f) as U(𝐱;f):=U(𝐱;f)𝟙[U(𝐱~;f)ν]U(\mathbf{x};f)\vcentcolon=U^{\prime}(\mathbf{x};f)\cdot\mathbbm{1}[U^{\prime}(\tilde{\mathbf{x}};f)\geq\nu], where ν0\nu\geq 0 satisfies n=1N𝟙[U(𝐱n;f)ν]=M\sum_{n=1}^{N}\mathbbm{1}[U^{\prime}(\mathbf{x}_{n};f)\geq\nu]=M, a pre-defined number. Then weighted kk-means with uncertainty UU^{\prime}, changed to use greedy kernel kk-medoids, is UHerding with uncertainty UU and the same kernel.

As we shall see in Section 4, UHerding significantly outperforms weighted kk-means, indicating that it is crucial to (a) convert kk-means into MaxHerding and (b) apply parameter adaptation.

{restatable}

[ALFA-Mix of Parvaneh et al. 2022]propositionalfamix Let y^(;f)\hat{y}(\cdot;f) be the predicted label of an input under ff. Define an uncertainty measure

U(𝐱;f):=𝟙[ class j s.t. y^(αj(𝐱)g(𝐱)+(1αj(𝐱))g¯j;f)y^(g(𝐱);f)]\displaystyle U(\mathbf{x};f)\vcentcolon=\mathbbm{1}\left[\exists\,\text{ class }j\text{ s.t. }\;\hat{y}\bigl{(}\alpha_{j}(\mathbf{x})\,g(\mathbf{x})+(1-\alpha_{j}(\mathbf{x}))\,\bar{g}_{j};f\bigr{)}\neq\hat{y}(g(\mathbf{x});f)\right] (3)

where g¯j\bar{g}^{j} is the mean of feature representations belonging to class jj and αj(𝐱)[0,1)\alpha_{j}(\mathbf{x})\in[0,1) is the same parameter as determined by ALFA-Mix. Then ALFA-Mix, with clustering replaced by greedy kernel kk-medoids, is UHerding with uncertainty UU and the same kernel.

The equivalence of weighted kk-means and ALFA-Mix to UHerding with the right choice of uncertainty measure implies that Sections 3.2 and 3.2 also apply to them. There is a weaker connection to BADGE; UHerding and BADGE are not equivalent with any choice of uncertainty measure. However, a variant of BADGE with greedy kernel kk-medoids uses the kernel,

h(𝐱n,𝐱)=2q(𝐱n),q(𝐱)kσ(𝐱n,𝐱;g)q(𝐱n)22kσ(𝐱n,𝐱n;g)q(𝐱)22kσ(𝐱,𝐱;g)h(\mathbf{x}_{n},\mathbf{x}^{\prime})=2\langle q(\mathbf{x}_{n}),q(\mathbf{x}^{\prime})\rangle k_{\sigma}(\mathbf{x}_{n},\mathbf{x}^{\prime};g)-\left\lVert q(\mathbf{x}_{n})\right\rVert_{2}^{2}k_{\sigma}(\mathbf{x}_{n},\mathbf{x}_{n};g)-\left\lVert q(\mathbf{x}^{\prime})\right\rVert_{2}^{2}k_{\sigma}(\mathbf{x}^{\prime},\mathbf{x}^{\prime};g) (4)

where q(𝐱)=y^(𝐱;f)p^(𝐱;f)q(\mathbf{x})=\hat{y}(\mathbf{x};f)-\hat{p}(\mathbf{x};f) with ff being a classifier. This does satisfy the following statement, showing properties similar to Sections 3.2 and 3.2 hold. {restatable}[BADGE, Ash et al., 2020]propositionbadge

If 𝐱𝒰,p^(𝐱;f)1K1\forall\mathbf{x}\in\mathcal{U},\,\hat{p}(\mathbf{x};f)\rightarrow\frac{1}{K}\vec{1}, then this BADGE approaches a slightly modified MaxHerding: (𝟙[y^(𝐱n;f)=y^(𝐱;f)]1K)kσ(𝐱n,𝐱;g)\left(\mathbbm{1}\left[\hat{y}(\mathbf{x}_{n};f)=\hat{y}(\mathbf{x}^{\prime};f)\right]-\frac{1}{K}\right)k_{\sigma}(\mathbf{x}_{n},\mathbf{x}^{\prime};g) instead of kσ(𝐱n,𝐱;g)k_{\sigma}(\mathbf{x}_{n},\mathbf{x}^{\prime};g). If σ0\sigma\rightarrow 0, it approaches to the uncertainty-based method where uncertainty is defined as U′′(𝐱~):=min𝐱{𝐱~}y^(𝐱;f)p^(𝐱;f)22U^{\prime\prime}(\tilde{\mathbf{x}})\vcentcolon=\min_{\mathbf{x}^{\prime}\in\mathcal{L}\cup\{\tilde{\mathbf{x}}\}}\left\lVert\hat{y}(\mathbf{x}^{\prime};f)-\hat{p}(\mathbf{x}^{\prime};f)\right\rVert_{2}^{2}. With our parameter adaptation heuristics, hybrid methods can also smoothly interpolate between MaxHerding and uncertainty; Figure 6(b) shows this improves BADGE. Although selecting data points maximizing U′′(𝐱~)U^{\prime\prime}(\tilde{\mathbf{x}}) is counter-intuitive, lowering σ\sigma still helps, as it stays away from zero.

4 Experiments

In this section, we assess the robustness of the proposed UHerding against existing active learning methods for standard supervised learning (Sections 4.1 and 4.2) and transfer learning (Section 4.3) across several benchmark datasets: CIFAR10 (Krizhevsky, 2009), CIFAR100 (Krizhevsky et al., ), TinyImageNet (mnmoustafa, 2017), ImageNet (Deng et al., 2009), and DomainNet (Peng et al., 2019). Section 4.4 gives ablation studies to see how each component of UHerding contributes.

Active learning methods

We compare with several active learning methods listed below. We exclude ProbCover; its generalization MaxHerding reliably outperforms it (Bae et al., 2024).

Random

Uniformly select random BB data points from 𝒰\mathcal{U}.

Confidence

Iteratively select 𝐱argmin𝐱~𝒰p1(y|𝐱~)\mathbf{x}^{*}\in\operatorname*{arg\,min}_{\tilde{\mathbf{x}}\in\mathcal{U}}p_{1}(y|\tilde{\mathbf{x}}), with p1p_{1} the highest predicted probability.

Margin

Iteratively select 𝐱=argmin𝐱~𝒰p1(y|𝐱~)p2(y|𝐱~)\mathbf{x}^{*}=\operatorname*{arg\,min}_{\tilde{\mathbf{x}}\in\mathcal{U}}p_{1}(y|\tilde{\mathbf{x}})-p_{2}(y|\tilde{\mathbf{x}}), where p2p_{2} is the second-highest predicted probability (Scheffer et al., 2001).

Entropy

Iteratively select 𝐱=argmax𝐱~𝒰H(y^(𝐱~)𝐱~)\mathbf{x}^{*}=\operatorname*{arg\,max}_{\tilde{\mathbf{x}}\in\mathcal{U}}\operatorname{H}(\hat{y}(\tilde{\mathbf{x}})\mid\tilde{\mathbf{x}}) , where H()\operatorname{H}(\cdot) is the Shannon entropy (Wang & Shang, 2014).

Weighted Entropy

Select points closest to the centroids of weighted kk-means using unlabeled data points with high enough margins (as in Margin selection above) (Zhdanov, 2019).

BADGE

Select points with the kk-means++ initialization algorithm using gradient embeddings w.r.t. the weights of the last layer (Ash et al., 2020).

ALFA-Mix

Select data points closest to the centroids of kk-means only using uncertain unlabeled data points, based on feature interpolation (Parvaneh et al., 2022).

ActiveFT

Select data points close to parameterized “centeroids,” learned by minimizing KL between the unlabeled set and selected set, along with diversity regularization (Xie et al., 2023).

Typiclust

Run a clustering algorithm, e.g., kk-means. For each cluster, select a point with the highest “typicality” using mm-Nearest Neighbors (Hacohen et al., 2022).

Coreset

Iteratively select points with kk-Center-Greedy algorithm (Sener & Savarese, 2018): 𝐱=argmax𝐱~𝒰min𝐱𝐱~𝐱2\mathbf{x}^{*}=\operatorname*{arg\,max}_{\tilde{\mathbf{x}}\in\mathcal{U}}\min_{\mathbf{x}^{\prime}\in\mathcal{L}}{\lVert\tilde{\mathbf{x}}-\mathbf{x}^{\prime}\rVert}_{2}.

MaxHerding

Iteratively select points to maximize the generalized coverage (1(Bae et al., 2024).

Implementation

We always re-initialize models (cold-start), randomly or from pre-trained parameters, after each round of acquisition, rather than warm-starting from the previous model (Ash & Adams, 2020). We manually categorize the budget regimes into low, mid, and high for supervised learning, and low and high for transfer learning tasks. Representation-based methods win in low-budget regimes, while uncertainty-based begin to catch up in mid-budget, and win in high-budget.

Refer to caption
Figure 3: UHerding versus MaxHerding and uncertainty, with different uncertainty measures. Mean and standard deviation of 5 runs of the difference between a method and Random selection.

4.1 UHerding Interpolates, and the Uncertainty Measure Doesn’t Matter

We first aim to verify that UHerding effectively interpolates between MaxHerding and uncertainty measures across budget regimes. We train a randomly-initialized ResNet18 (He et al., 2016) on CIFAR10 using 5 random seeds, gradually increasing the size of the labeled set, as shown in Figure 3. We consider UHerding based on Margin (Mar.), Entropy (Ent.), and Confidence (Conf.). The y-axis of Figure 3 represents Δ\DeltaAcc, indicating the performance difference from Random selection.

UHerding performs slightly better or comparably to MaxHerding in the low-budget regime, with a growing performance gap in the mid-budget regime, and substantial improvements in the high-budget regime. The opposite is true with uncertainty measures: UHerding is substantially better with low budgets, while they eventually catch up and tie UHerding with high budgets.

The choice of uncertainty measure has little impact on performance. We thus only use Margin as the UHerding uncertainty measure in the future unless otherwise specified.

Refer to caption
(a) CIFAR100
Refer to caption
(b) TinyImageNet
Figure 4: Comparison on CIFAR100 and TinyImageNet for supervised-learning tasks.
Refer to caption
(a) CIFAR100
Refer to caption
(b) DomainNet
Figure 5: Comparison on CIFAR100 and DomainNet for transfer learning tasks.

4.2 Comparison with State of the Art

We now compare UHerding with state-of-the-art active learning methods, particularly hybrids, to assess their robustness across different budget regimes. Figure 4 compares CIFAR100 and TinyImagenet, using 3 runs of a randomly initialized ResNet18; CIFAR10 results are in Appendix B.

Overall trends are similar to before: representation-based methods are good with low budgets but lose as the budget increases; uncertainty-based and hybrid methods are largely the opposite. UHerding with Margin uncertainty (UHerding Mar.) wins convincingly: no competitor ever outperforms UHerding, and for each competitor there is some budget where UHerding wins substantially.

4.3 Comparison for Transfer Learning Tasks

Fine-tuning foundation models to new tasks or datasets is of increasing importance. Inspired by ActiveFT (Xie et al., 2023), we compare UHerding to leading approaches for active transfer learning.

We use DeiT (Touvron et al., 2021) pre-trained on ImageNet (Deng et al., 2009), following Parvaneh et al. (2022); Xie et al. (2023). We fine-tune the entire model, using DeiT Small for CIFAR-100 and DeiT Base for DomainNet. Figure 5 compares UHerding with various active learning methods, including ActiveFT, which consistently underperforms the other methods likely due to its design being optimized for single-iteration data selection. On CIFAR100, UHerding is comparable to representation-based methods in the low-budget regime but surpasses them by about 2%2\% in the high-budget; it substantially outperforms uncertainty and hybrid methods in the low-budget regime and ties or wins with high-budgets. On DomainNet, UHerding outperforms MaxHerding by 1.51.52%2\% even in the low-budget. Compared to the best-performing hybrid method ALFA-Mix, UHerding wins by up to 13%13\% with low budgets and performs similarly with high budgets.

It is also common to fine-tune only the last few layers, especially in meta-learning (Wang et al., 2019; Chen et al., 2019; Goldblum et al., 2020) and self-supervised learning (Chen et al., 2020; Caron et al., 2021). Similarly to Bae et al. (2024), we use DINO (Caron et al., 2021) features fixed through fine-tuning; we train a head of three fully connected ReLU layers on ImageNet. Figure 6(a) shows similar results, with UHerding consistently outperforming other methods across various budget regimes.

Table 1 summarizes the results of Sections 4.2, 4.3 and B, reporting the mean improvement/degradation over Random for each budget regime.555ActiveFT is reported only for DomainNet (Dom.) and ImageNet (ImN.), as it is designed for fine-tuning. UHerding wins across all budget regimes, while other methods have a significant range of budgets where they are worse than Random.

Refer to caption
(a) Transfer learning – ImageNet
Refer to caption
(b) Parameter adaptation on BADGE
Figure 6: (Left) results on ImageNet for fine-tuning, (Right) application of parameter adaptation.

4.4 Ablation Study

Figure 6(b) gradually modifies BADGE to be similar to UHerding with Margin uncertainty on CIFAR-100, replacing kk-means++ with MaxHerding, then adding parameter adaption. Each step improves; so does the final step to UHerding, which changes the choice of uncertainty.

In Appendix C, we analyze the contributions of each parameter adaptation component in UHerding. The baseline with fixed parameters performs slightly worse than MaxHerding at low- and mid-budgets, and worse than Margin at high-budgets. Adding temperature scaling results in notable improvements across all budget regimes, with further gains from incorporating radius adaptation.

Method Low Middle High
C10 C100 Tiny. Dom. ImN. C10 C100 Tiny. C10 C100 Tiny. Dom. ImN.
Entropy -1.8 -1.7 -0.6 -0.2 1.7 -1.6 -2.9 -1.7 2.2 -0.6 -0.7 0.7 1.2
Margin -0.4 -0.3 -0.2 1.0 1.8 -0.1 -0.4 -0.3 2.5 1.1 0.0 1.5 0.9
BADGE -0.5 -0.1 -0.2 1.4 2.0 0.6 -0.7 0.0 2.2 0.9 0.4 1.8 1.0
ALFA-M 0.1 0.9 -0.3 2.8 5.1 1.1 0.6 0.1 2.3 1.3 0.2 1.9 1.0
Weight. k -0.5 -0.1 -0.3 2.1 3.8 0.9 0.0 -0.2 1.8 0.8 0.3 1.7 0.8
Coreset -2.7 -4.5 -1.4 -3.5 -6.6 -13 -11 -5.4 -10 -9.6 -5.5 -2.7 -12
ActiveFT 4.4 6.6 0.0 -0.1
Typiclust 3.7 3.3 1.6 3.1 4.9 4.9 1.8 2.1 -0.8 -0.1 0.3 -3.2 -9.9
MaxHerd. 5.0 4.1 2.1 6.2 10.6 6.2 2.8 1.9 0.1 -2.2 -1.5 1.0 -1.2
UHerding 5.5 5.5 3.1 7.4 11.2 7.8 4.3 3.7 3.0 2.1 0.8 2.3 2.0
 
Table 1: Comparison of the mean improvement/degradation over Random selection on each budget regime and dataset. The first, second, third best results for each setting are marked.

5 Conclusion

In this work, we introduced uncertainty coverage, an objective that unifies low- and high-budget active learning objectives through a smooth interpolation with adaptive parameter adjustments. We showed generalization guarantees for the optimization of this coverage. By identifying conditions under which uncertainty coverage approaches generalized coverage and uncertainty measures, we made UHerding robust across various budget regimes. This adaptation also enhances an existing hybrid active learning method when similar parameter adjustments are applied. UHerding achieves state-of-the-art performance across various active learning and transfer learning tasks and, to our knowledge, is the only method that consistently performs well in both low- and high-budget settings.

Acknowledgments

This work was supported in part by Mitacs through the Mitacs Accelerate program, the Natural Sciences and Engineering Resource Council of Canada, the Canada CIFAR AI chairs program, and Advanced Research Computing at the University of British Columbia.

References

  • Aghaee et al. (2016) Amin Aghaee, Mehrdad Ghadiri, and Mahdieh Soleymani Baghshah. Active distance-based clustering using k-medoids. In PAKDD, 2016.
  • Ash & Adams (2020) Jordan Ash and Ryan P Adams. On warm-starting neural network training. In NeurIPS, 2020.
  • Ash et al. (2021) Jordan Ash, Surbhi Goel, Akshay Krishnamurthy, and Sham Kakade. Gone fishing: Neural active learning with fisher embeddings. 2021.
  • Ash et al. (2020) Jordan T Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agarwal. Deep batch active learning by diverse, uncertain gradient lower bounds. In ICLR, 2020.
  • Bae et al. (2024) Wonho Bae, Junhyug Noh, and Danica J Sutherland. Generalized coverage for more robust low-budget active learning. In ECCV, 2024.
  • Bıyık et al. (2019) Erdem Bıyık, Kenneth Wang, Nima Anari, and Dorsa Sadigh. Batch active learning using determinantal point processes. In NeurIPS, 2019.
  • Caron et al. (2021) Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In ICCV, 2021.
  • Chen et al. (2020) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In ICML, 2020.
  • Chen et al. (2019) Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, and Jia-Bin Huang. A closer look at few-shot classification. In ICLR, 2019.
  • Chen et al. (2010) Yutian Chen, Max Welling, and Alex Smola. Super-samples from kernel herding. In UAI, 2010.
  • Cucker & Smale (2001) Felipe Cucker and Steve Smale. On the mathematical foundations of learning. Bulletin of the American Mathematical Society, 2001.
  • Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009.
  • Donmez et al. (2007) Pinar Donmez, Jaime G Carbonell, and Paul N Bennett. Dual strategy active learning. In ECML, 2007.
  • Doucet et al. (2024) Paul Doucet, Benjamin Estermann, Till Aczel, and Roger Wattenhofer. Bridging diversity and uncertainty in active learning with self-supervised pre-training. In ICLR Workshop, 2024.
  • Freytag et al. (2014) Alexander Freytag, Erik Rodner, and Joachim Denzler. Selecting influential examples: Active learning with expected model output changes. In ECCV, 2014.
  • Gal et al. (2017) Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep bayesian active learning with image data. In ICML, 2017.
  • Goldblum et al. (2020) Micah Goldblum, Steven Reich, Liam Fowl, Renkun Ni, Valeriia Cherepanova, and Tom Goldstein. Unraveling meta-learning: Understanding feature representations for few-shot tasks. In ICML, 2020.
  • Guo et al. (2017) Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In ICML, 2017.
  • Guo & Greiner (2007) Yuhong Guo and Russell Greiner. Optimistic active-learning using mutual information. In IJCAI, 2007.
  • Hacohen & Weinshall (2024) Guy Hacohen and Daphna Weinshall. How to select which active learning strategy is best suited for your specific problem and budget. In NeurIPS, 2024.
  • Hacohen et al. (2022) Guy Hacohen, Avihu Dekel, and Daphna Weinshall. Active learning on a budget: Opposite strategies suit high and low budgets. In ICML, 2022.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • Käding et al. (2016) Christoph Käding, Erik Rodner, Alexander Freytag, and Joachim Denzler. Active and continuous exploration with deep neural networks and expected model output changes. In NIPSW, 2016.
  • Käding et al. (2018) Christoph Käding, Erik Rodner, Alexander Freytag, Oliver Mothes, Björn Barz, Joachim Denzler, and Carl Zeiss AG. Active learning for regression tasks with expected model output changes. In BMVC, 2018.
  • Kaufman & Rousseeuw (2009) Leonard Kaufman and Peter J Rousseeuw. Finding groups in data: an introduction to cluster analysis. John Wiley & Sons, 2009.
  • Kirsch et al. (2019) Andreas Kirsch, Joost Van Amersfoort, and Yarin Gal. Batchbald: Efficient and diverse batch acquisition for deep bayesian active learning. In NeurIPS, 2019.
  • Krizhevsky (2009) Alex Krizhevsky. Learning multiple layers of features from tiny images, 2009.
  • (28) Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-100 (canadian institute for advanced research). URL http://www.cs.toronto.edu/~kriz/cifar.html.
  • Kye et al. (2023) Seong Min Kye, Kwanghee Choi, Hyeongmin Byun, and Buru Chang. Tidal: Learning training dynamics for active learning. In ICCV, 2023.
  • Lewis & Catlett (1994) David D Lewis and Jason Catlett. Heterogeneous uncertainty sampling for supervised learning. In Machine learning proceedings, 1994.
  • Lewis & Gale (1994) David D Lewis and William A Gale. A sequential algorithm for training text classifiers. In SIGIR, 1994.
  • Mahmood et al. (2022) Rafid Mahmood, Sanja Fidler, and Marc T Law. Low budget active learning via wasserstein distance: An integer programming approach. In ICLR, 2022.
  • mnmoustafa (2017) Mohammed Ali mnmoustafa. Tiny imagenet, 2017. URL https://kaggle.com/competitions/tiny-imagenet.
  • Mohamadi et al. (2022) Mohamad Amin Mohamadi, Wonho Bae, and Danica J Sutherland. Making look-ahead active learning strategies feasible with neural tangent kernels. In NeurIPS, 2022.
  • Naeini et al. (2015) Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. In AAAI, 2015.
  • Nemhauser et al. (1978) G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing submodular set functions—i. Mathematical Programming, 1978.
  • Nguyen & Smeulders (2004) Hieu T Nguyen and Arnold Smeulders. Active learning using pre-clustering. In ICML, 2004.
  • Parvaneh et al. (2022) Amin Parvaneh, Ehsan Abbasnejad, Damien Teney, Gholamreza Reza Haffari, Anton Van Den Hengel, and Javen Qinfeng Shi. Active learning by feature mixing. In CVPR, 2022.
  • Peng et al. (2019) Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment matching for multi-source domain adaptation. In ICCV, 2019.
  • Roy & McCallum (2001) Nicholas Roy and Andrew McCallum. Toward optimal active learning through monte carlo estimation of error reduction. In ICML, 2001.
  • Scheffer et al. (2001) Tobias Scheffer, Christian Decomain, and Stefan Wrobel. Active hidden Markov models for information extraction. In ISIDA, 2001.
  • Schubert & Lenssen (2022) Erich Schubert and Lars Lenssen. Fast k-medoids clustering in Rust and Python. Journal of Open Source Software, 2022.
  • Schubert & Rousseeuw (2019) Erich Schubert and Peter J Rousseeuw. Faster k-medoids clustering: improving the PAM, CLARA, and CLARANS algorithms. In Similarity Search and Applications: 12th International Conference, SISAP 2019, Newark, NJ, USA, October 2–4, 2019, Proceedings 12, 2019.
  • Schubert & Rousseeuw (2021) Erich Schubert and Peter J Rousseeuw. Fast and eager k-medoids clustering: O(k) runtime improvement of the PAM, CLARA, and CLARANS algorithms. Information Systems, 2021.
  • Sener & Savarese (2018) Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. In ICLR, 2018.
  • Settles (2009) Burr Settles. Active learning literature survey. University of Wisconsin-Madison Department of Computer Sciences, 2009.
  • Settles & Craven (2008) Burr Settles and Mark Craven. An analysis of active learning strategies for sequence labeling tasks. In EMNLP, 2008.
  • Settles et al. (2007) Burr Settles, Mark Craven, and Soumya Ray. Multiple-instance active learning. 2007.
  • Touvron et al. (2021) Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In ICML, 2021.
  • Voevodski et al. (2012) Konstantin Voevodski, Maria-Florina Balcan, Heiko Röglin, Shang-Hua Teng, and Yu Xia. Active clustering of biological sequences. In JMLR, 2012.
  • Wang & Shang (2014) Dan Wang and Yi Shang. A new active labeling method for deep learning. In IJCNN, 2014.
  • Wang et al. (2019) Yan Wang, Wei-Lun Chao, Kilian Q. Weinberger, and Laurens van der Maaten. Simpleshot: Revisiting nearest-neighbor classification for few-shot learning. arXiv:1911.04623, 2019.
  • Xie et al. (2023) Yichen Xie, Han Lu, Junchi Yan, Xiaokang Yang, Masayoshi Tomizuka, and Wei Zhan. Active finetuning: Exploiting annotation budget in the pretraining-finetuning paradigm. In CVPR, 2023.
  • Xu et al. (2003) Zhao Xu, Kai Yu, Volker Tresp, Xiaowei Xu, and Jizhi Wang. Representative sampling for text classification using support vector machines. In ECIR, 2003.
  • Yehuda et al. (2022) Ofer Yehuda, Avihu Dekel, Guy Hacohen, and Daphna Weinshall. Active learning through a covering lens. In NeurIPS, 2022.
  • Zhdanov (2019) Fedor Zhdanov. Diverse mini-batch active learning. arXiv preprint arXiv:1901.05954, 2019.
  • Zhu et al. (2003) Xiaojin Zhu, John Lafferty, and Zoubin Ghahramani. Combining active learning and semi-supervised learning using gaussian fields and harmonic functions. In ICML Workshop, 2003.

Appendix A Proofs

A.1 Estimation Quality

\uniconv

*

Proof.

Rather than operating directly on sets 𝒮𝒳\mathcal{S}\subseteq\mathcal{X}, we will operate on BB-tuples 𝒯(d)B\mathcal{T}\in(\mathbb{R}^{d})^{B} corresponding to (g(𝐬1),,g(𝐬B))(g(\mathbf{s}_{1}),\dots,g(\mathbf{s}_{B})). As these are ordered tuples, the mapping from 𝒯\mathcal{T} to 𝒮\mathcal{S} is many-to-one even if gg is injective; if we prove convergence for all 𝒯\mathcal{T}, it will necessarily prove convergence for all 𝒮\mathcal{S}. Define F(𝒯)=UCkσ(𝒮)F(\mathcal{T})=\mathrm{UC}_{k_{\sigma}}(\mathcal{L}\cup\mathcal{S}) and F^(𝒯)=UC^kσ(𝒮)\hat{F}(\mathcal{T})=\widehat{\mathrm{UC}}_{k_{\sigma}}(\mathcal{L}\cup\mathcal{S}) for any 𝒮\mathcal{S} corresponding to that 𝒯\mathcal{T}; this is well-defined, since UCkσ\mathrm{UC}_{k_{\sigma}} depends on 𝒮\mathcal{S} only through {g(𝐬):𝐬𝒮}\{g(\mathbf{s}):\mathbf{s}\in\mathcal{S}\}.

Now, we will construct a vector space for elements 𝒯=(𝐭1,,𝐭B)\mathcal{T}=(\mathbf{t}_{1},\dots,\mathbf{t}_{B}). Vector addition and scalar multiplication are defined elementwise, and a norm is defined as 𝒯=max(𝐭1,,𝐭B)\left\lVert\mathcal{T}\right\rVert=\max(\left\lVert\mathbf{t}_{1}\right\rVert,\dots,\left\lVert\mathbf{t}_{B}\right\rVert), where 𝐭i\left\lVert\mathbf{t}_{i}\right\rVert is the standard Euclidean norm. This space is complete, and hence a Banach space of dimension BdBd.

The uncertainty coverage F(𝒯)F(\mathcal{T}) is Lipschitz with respect to the Banach space norm:

|F(𝒯)F(𝒯)|\displaystyle\left\lvert F(\mathcal{T})-F(\mathcal{T}^{\prime})\right\rvert 𝔼𝐱U(𝐱;f)|max𝐭{g(𝐱~):𝐱~}𝒯k~σ(g(𝐱~),𝐭)max𝐭{g(𝐱~):𝐱~}𝒯k~σ(g(𝐱),𝐭)|\displaystyle\leq\mathbb{E}_{\mathbf{x}}U(\mathbf{x};f)\left\lvert\max_{\mathbf{t}\in\{g(\tilde{\mathbf{x}}):\tilde{\mathbf{x}}\in\mathcal{L}\}\cup\mathcal{T}}\tilde{k}_{\sigma}(g(\tilde{\mathbf{x}}),\mathbf{t})-\max_{\mathbf{t}^{\prime}\in\{g(\tilde{\mathbf{x}}):\tilde{\mathbf{x}}\in\mathcal{L}\}\cup\mathcal{T}^{\prime}}\tilde{k}_{\sigma}(g(\mathbf{x}),\mathbf{t}^{\prime})\right\rvert
𝔼𝐱U(𝐱;f)maxi[B]|k~σ(g(𝐱),𝐭i)k~σ(g(𝐱),𝐭i)|\displaystyle\leq\mathbb{E}_{\mathbf{x}}U(\mathbf{x};f)\max_{i\in[B]}{\left\lvert\tilde{k}_{\sigma}(g(\mathbf{x}),\mathbf{t}_{i})-\tilde{k}_{\sigma}(g(\mathbf{x}),\mathbf{t}^{\prime}_{i})\right\rvert}
𝔼𝐱U(𝐱;f)maxi[B]Lσ𝐭i𝐭i\displaystyle\leq\mathbb{E}_{\mathbf{x}}U(\mathbf{x};f)\max_{i\in[B]}{L_{\sigma}\left\lVert\mathbf{t}_{i}-\mathbf{t}^{\prime}_{i}\right\rVert}
=Lσ(𝔼𝐱~U(𝐱~;f))𝒯𝒯.\displaystyle=L_{\sigma}\left(\mathbb{E}_{\tilde{\mathbf{x}}}U(\tilde{\mathbf{x}};f)\right)\left\lVert\mathcal{T}-\mathcal{T}^{\prime}\right\rVert.

The second inequality holds because the maximum function is Lipschitz on N\mathbb{R}^{N}. Specifically, let a1,,aN,a1,,aNa_{1},\dots,a_{N},a^{\prime}_{1},\dots,a^{\prime}_{N}\in\mathbb{R}, and let n^argmaxn[N]an\hat{n}\in\operatorname*{arg\,max}_{n\in[N]}a_{n}. Then

maxn[N]anmaxn[N]an=an^maxnNanan^an^maxnNananmaxnN|anan|,\max_{n\in[N]}a_{n}-\max_{n\in[N]}a^{\prime}_{n}=a_{\hat{n}}-\max_{n\in N}a^{\prime}_{n}\leq a_{\hat{n}}-a^{\prime}_{\hat{n}}\leq\max_{n\in N}a_{n}-a^{\prime}_{n}\leq\max_{n\in N}\left\lvert a_{n}-a^{\prime}_{n}\right\rvert,

and, symmetrically, it is at least max|anan|-\max\left\lvert a_{n}-a^{\prime}_{n}\right\rvert, so |maxnanmaxnan|maxn|anan|\left\lvert\max_{n}a_{n}-\max_{n}a^{\prime}_{n}\right\rvert\leq\max_{n}\left\lvert a_{n}-a^{\prime}_{n}\right\rvert.

As F^\hat{F} is exactly FF with an empirical distribution for 𝐱\mathbf{x}, it is Lσ(1Nn=1NU(𝐱n))L_{\sigma}\left(\frac{1}{N}\sum_{n=1}^{N}U(\mathbf{x}_{n})\right)-Lipschitz. We thus have

|(F(𝒯)F^(𝒯))(F(𝒯)F^(𝒯))|\displaystyle\left\lvert\bigl{(}F(\mathcal{T})-\hat{F}(\mathcal{T})\bigr{)}-\bigl{(}F(\mathcal{T}^{\prime})-\hat{F}(\mathcal{T}^{\prime})\bigr{)}\right\rvert Lσ(𝔼𝐱U(𝐱)+1Nn=1NU(𝐱))𝒯𝒯\displaystyle\leq L_{\sigma}\left(\mathbb{E}_{\mathbf{x}}U(\mathbf{x})+\frac{1}{N}\sum_{n=1}^{N}U(\mathbf{x})\right)\left\lVert\mathcal{T}-\mathcal{T}^{\prime}\right\rVert
2LσUmax𝒯𝒯.\displaystyle\leq 2L_{\sigma}U_{\max}\left\lVert\mathcal{T}-\mathcal{T}^{\prime}\right\rVert.

By Proposition 5 of Cucker & Smale (2001), we can cover the ball {𝒯:𝒯R}\{\mathcal{T}:\left\lVert\mathcal{T}\right\rVert\leq R\} with at most (4R/η)Bd\left(4R/\eta\right)^{Bd} balls of radius η\eta with respect to the metric 𝒯𝒯\left\lVert\mathcal{T}-\mathcal{T}^{\prime}\right\rVert. So, to construct our covering argument, we apply the (bidirectional) Hoeffding inequality to the center of each of these balls, with failure probability δ/(4R/η)Bd\delta/\left(4R/\eta\right)^{Bd} for each. Combining this with how much FF^F-\hat{F} can change between an arbitrary point in {𝒯:𝒯R}\{\mathcal{T}:\left\lVert\mathcal{T}\right\rVert\leq R\} and the nearest center, we have that for all η(0,R)\eta\in(0,R), it holds with probability at least 1δ1-\delta that

sup𝒯|F(𝒯)F^(𝒯)|2LσUmaxη+UmaxBd2Nlog4Rη+12Nlog2δ.\displaystyle\sup_{\mathcal{T}}\left\lvert F(\mathcal{T})-\hat{F}(\mathcal{T})\right\rvert\leq 2L_{\sigma}U_{\max}\eta+U_{\max}\sqrt{\frac{Bd}{2N}\log\frac{4R}{\eta}+\frac{1}{2N}\log\frac{2}{\delta}}.

The result follows by picking η=4B/N\eta=4\sqrt{B/N} and using log(a)=12log(a2)\log(a)=\frac{1}{2}\log(a^{2}). ∎

A.2 Parameter Adaptation

\togcover

*

Proof.

As U(𝐱)c𝐱𝒳U(\mathbf{x})\rightarrow c\,\forall\mathbf{x}\in\mathcal{X}, the following equality holds:

limU(𝐱)cUC^kσ(𝒮)\displaystyle\lim_{U(\mathbf{x})\rightarrow c}\widehat{\mathrm{UC}}_{k_{\sigma}}(\mathcal{S}) =limU(𝐱)c1Nn=1NU(𝐱n)max𝐱𝒮kσ(𝐱n,𝐱)\displaystyle=\lim_{U(\mathbf{x})\rightarrow c}\frac{1}{N}\sum_{n=1}^{N}U(\mathbf{x}_{n})\cdot\max_{\mathbf{x}^{\prime}\in\mathcal{S}}k_{\sigma}(\mathbf{x}_{n},\mathbf{x}^{\prime})
=1Nn=1NlimU(𝐱n)cU(𝐱n)max𝐱𝒮kσ(𝐱n,𝐱)=cC^kσ(𝒮).\displaystyle=\frac{1}{N}\sum_{n=1}^{N}\lim_{U(\mathbf{x}_{n})\rightarrow c}U(\mathbf{x}_{n})\cdot\max_{\mathbf{x}^{\prime}\in\mathcal{S}}k_{\sigma}(\mathbf{x}_{n},\mathbf{x}^{\prime})=c\cdot\widehat{\mathrm{C}}_{k_{\sigma}}(\mathcal{S}).

The last equality holds since each term with limit inside the sum converges to a specific value. ∎

\tounc

*

Proof.

As σ0\sigma\rightarrow 0, the function kσk_{\sigma} approaches to the following form:

kσ(𝐱,𝐱)={1if 𝐱=𝐱,0otherwise.k_{\sigma}(\mathbf{x},\mathbf{x}^{\prime})=\begin{cases}1&\text{if }\mathbf{x}=\mathbf{x}^{\prime},\\ 0&\text{otherwise}.\end{cases}

Then, we have

UC^kσ(𝒮)\displaystyle\widehat{\mathrm{UC}}_{k_{\sigma}}(\mathcal{S}) =1Nn=1NU(𝐱n)max𝐱𝒮kσ(𝐱,𝐱)\displaystyle=\frac{1}{N}\sum_{n=1}^{N}U(\mathbf{x}_{n})\cdot\max_{\mathbf{x}^{\prime}\in\mathcal{S}}k_{\sigma}(\mathbf{x},\mathbf{x}^{\prime})
=1Nn=1NU(𝐱n)𝟙[𝐱𝒮 s.t 𝐱n=𝐱]s=1|𝒮|U(𝐱s)\displaystyle=\frac{1}{N}\sum_{n=1}^{N}U(\mathbf{x}_{n})\cdot\mathbbm{1}[\,\exists\mathbf{x}^{\prime}\in\mathcal{S}\,\text{ s.t }\,\mathbf{x}_{n}=\mathbf{x}^{\prime}]\propto\sum_{s=1}^{\lvert\mathcal{S}\rvert}U(\mathbf{x}_{s})

Note that the indicator function is 1 only if 𝐱n\mathbf{x}_{n} is equal to one of data points in 𝒮\mathcal{S}. ∎

A.3 Greedy algorithm

\coveroptim

*

Proof.

First, UC^kσ\widehat{\mathrm{UC}}_{k_{\sigma}} is submodular by Section A.3. Thus, if we let 𝒮¯argmax𝒮𝒰:|𝒮|=BUC^kσ(𝒮)\bar{\mathcal{S}}\in\operatorname*{arg\,max}_{\mathcal{S}\subseteq\mathcal{U}:\left\lvert\mathcal{S}\right\rvert=B}\widehat{\mathrm{UC}}_{k_{\sigma}}(\mathcal{S}), the classical result of Nemhauser et al. (1978) implies that UC^kσ(𝒮^)(11e)UC^kσ(𝒮¯)\widehat{\mathrm{UC}}_{k_{\sigma}}(\hat{\mathcal{S}})\geq\left(1-\frac{1}{e}\right)\widehat{\mathrm{UC}}_{k_{\sigma}}(\bar{\mathcal{S}}).

Let 𝒮argmax𝒮𝒰:|𝒮|=BUCkσ(𝒮)\mathcal{S}^{*}\in\operatorname*{arg\,max}_{\mathcal{S}\subseteq\mathcal{U}:\left\lvert\mathcal{S}\right\rvert=B}\mathrm{UC}_{k_{\sigma}}(\mathcal{S}) be the optimal size-BB subset of 𝒰\mathcal{U} for UCkσ\mathrm{UC}_{k_{\sigma}}. By definition, UC^kσ(𝒮¯)UC^kσ(𝒮)\widehat{\mathrm{UC}}_{k_{\sigma}}(\bar{\mathcal{S}})\geq\widehat{\mathrm{UC}}_{k_{\sigma}}(\mathcal{S}^{*}). Thus, calling the bound on the worst-case absolute error of the coverage estimate ε\varepsilon, it holds with probability at least 1δ1-\delta that

UCkσ(𝒮^)UC^kσ(𝒮^)ε(11e)UC^kσ(𝒮)ε(11e)UCkσ(𝒮)(21e)ε.\mathrm{UC}_{k_{\sigma}}(\hat{\mathcal{S}})\geq\widehat{\mathrm{UC}}_{k_{\sigma}}(\hat{\mathcal{S}})-\varepsilon\geq\left(1-\frac{1}{e}\right)\widehat{\mathrm{UC}}_{k_{\sigma}}(\mathcal{S}^{*})-\varepsilon\geq\left(1-\frac{1}{e}\right)\mathrm{UC}_{k_{\sigma}}(\mathcal{S}^{*})-\left(2-\frac{1}{e}\right)\varepsilon.\qed

The following result guarantees that greedy optimization of UCkσ\mathrm{UC}_{k_{\sigma}} or UC^kσ\widehat{\mathrm{UC}}_{k_{\sigma}} achieves a (11/e)\left(1-1/e\right) approximation, by the result of Nemhauser et al. (1978).

Lemma 2.

The functions UCkσ\mathrm{UC}_{k_{\sigma}} and UC^kσ\widehat{\mathrm{UC}}_{k_{\sigma}} are nonnegative, submodular, monontone functions; thus so are the functions 𝒮UCkσ(𝒮)\mathcal{S}\mapsto\mathrm{UC}_{k_{\sigma}}(\mathcal{L}\cup\mathcal{S}) and 𝒮UC^kσ(𝒮)\mathcal{S}\mapsto\widehat{\mathrm{UC}}_{k_{\sigma}}(\mathcal{L}\cup\mathcal{S}) for any fixed \mathcal{L}.

Proof.

We prove that UCoverage is non-negative monotone submodular; this implies that UC^kσ\widehat{\mathrm{UC}}_{k_{\sigma}} is as well, as it is an instance of UCkσ\mathrm{UC}_{k_{\sigma}} with an empirical distribution for 𝐱\mathbf{x}.

For 𝐱,𝐱𝒳\mathbf{x},\mathbf{x}^{\prime}\in\mathcal{X}, we assumed that U(𝐱)0U(\mathbf{x})\geq 0 and kσ(𝐱,𝐱)0k_{\sigma}(\mathbf{x},\mathbf{x}^{\prime})\geq 0. Thus, for any subset 𝒜𝒳\mathcal{A}\subseteq\mathcal{X}, UCkσ(𝒜)0\mathrm{UC}_{k_{\sigma}}(\mathcal{A})\geq 0.

Next, we show monotonocity: for all 𝒜𝒳\mathcal{A}\subseteq\mathcal{B}\subseteq\mathcal{X}, UCkσ(𝒜)UCkσ()\mathrm{UC}_{k_{\sigma}}(\mathcal{A})\leq\mathrm{UC}_{k_{\sigma}}(\mathcal{B}).

UCkσ()\displaystyle\mathrm{UC}_{k_{\sigma}}(\mathcal{B}) =𝔼𝐱[U(𝐱)max𝐱kσ(𝐱,𝐱)]\displaystyle=\mathbb{E}_{\mathbf{x}}[U(\mathbf{x})\cdot\max_{\mathbf{x}^{\prime}\in\mathcal{B}}k_{\sigma}(\mathbf{x},\mathbf{x}^{\prime})]
=𝔼𝐱[U(𝐱)max(max𝐱𝒜kσ(𝐱,𝐱),max𝐱𝒜kσ(𝐱,𝐱))]\displaystyle=\mathbb{E}_{\mathbf{x}}\left[U(\mathbf{x})\cdot\max\left(\max_{\mathbf{x}^{\prime}\in\mathcal{B}\setminus\mathcal{A}}k_{\sigma}(\mathbf{x},\mathbf{x}^{\prime}),\max_{\mathbf{x}^{\prime}\in\mathcal{A}}k_{\sigma}(\mathbf{x},\mathbf{x}^{\prime})\right)\right]
𝔼𝐱[U(𝐱)max𝐱𝒜kσ(𝐱,𝐱)]=UCkσ(𝒜).\displaystyle\geq\mathbb{E}_{\mathbf{x}}[U(\mathbf{x})\cdot\max_{\mathbf{x}^{\prime}\in\mathcal{A}}k_{\sigma}(\mathbf{x},\mathbf{x}^{\prime})]=\mathrm{UC}_{k_{\sigma}}(\mathcal{A}).

Lastly, we show submodularity: for all 𝒜𝒳\mathcal{A}\subseteq\mathcal{B}\subseteq\mathcal{X},

UCkσ(𝒜{𝐱~})UCkσ(𝒜)\displaystyle\mathrm{UC}_{k_{\sigma}}(\mathcal{A}\cup\{\tilde{\mathbf{x}}\})-\mathrm{UC}_{k_{\sigma}}(\mathcal{A}) =𝔼𝐱[U(𝐱)max(kσ(𝐱,𝐱~)max𝐱𝒜kσ(𝐱,𝐱),0)]\displaystyle=\mathbb{E}_{\mathbf{x}}\left[U(\mathbf{x})\cdot\max\left(k_{\sigma}(\mathbf{x},\tilde{\mathbf{x}})-\max_{\mathbf{x}^{\prime}\in\mathcal{A}}k_{\sigma}(\mathbf{x},\mathbf{x}^{\prime}),0\right)\right]
𝔼𝐱[U(𝐱)max(kσ(𝐱,𝐱~)max𝐱kσ(𝐱,𝐱),0)]\displaystyle\geq\mathbb{E}_{\mathbf{x}}\left[U(\mathbf{x})\cdot\max\left(k_{\sigma}(\mathbf{x},\tilde{\mathbf{x}})-\max_{\mathbf{x}^{\prime}\in\mathcal{B}}k_{\sigma}(\mathbf{x},\mathbf{x}^{\prime}),0\right)\right]
=UCkσ({𝐱~})UCkσ().\displaystyle=\mathrm{UC}_{k_{\sigma}}(\mathcal{B}\cup\{\tilde{\mathbf{x}}\})-\mathrm{UC}_{k_{\sigma}}(\mathcal{B}).\qed

A.4 Connections to Hybrid Methods

\wkmeans

*

Proof.

With the modification of the kk-means objective into a greedy kernel kk-medoids objective, the objective of weighted kk-means with U(𝐱)U^{\prime}(\mathbf{x}) as weights can be converted into:

𝐱\displaystyle\mathbf{x}^{*} argmax𝐱~𝒳𝟙[U(𝐱~)ν]1Nn=1NU(𝐱n)minx{𝐱~}ϕ(𝐱n)ϕ(𝐱)2\displaystyle\in\operatorname*{arg\,max}_{\tilde{\mathbf{x}}\in\mathcal{X}}\mathbbm{1}[U^{\prime}(\tilde{\mathbf{x}})\geq\nu]\cdot\frac{1}{N}\sum_{n=1}^{N}U^{\prime}(\mathbf{x}_{n})\cdot\min_{x^{\prime}\in\mathcal{L}\cup\{\tilde{\mathbf{x}}\}}\lVert\phi(\mathbf{x}_{n})-\phi(\mathbf{x}^{\prime})\rVert_{\mathcal{H}}^{2} (5)
=argmax𝐱~𝒳1Nn=1NU(𝐱n)minx{𝐱~}ϕ(𝐱n)ϕ(𝐱)2\displaystyle=\operatorname*{arg\,max}_{\tilde{\mathbf{x}}\in\mathcal{X}}\frac{1}{N}\sum_{n=1}^{N}U(\mathbf{x}_{n})\cdot\min_{x^{\prime}\in\mathcal{L}\cup\{\tilde{\mathbf{x}}\}}\lVert\phi(\mathbf{x}_{n})-\phi(\mathbf{x}^{\prime})\rVert_{\mathcal{H}}^{2} (6)
=argmax𝐱~𝒳1Nn=1NU(𝐱n)maxx{𝐱~}kσ(𝐱n,𝐱).\displaystyle=\operatorname*{arg\,max}_{\tilde{\mathbf{x}}\in\mathcal{X}}\frac{1}{N}\sum_{n=1}^{N}U(\mathbf{x}_{n})\cdot\max_{x^{\prime}\in\mathcal{L}\cup\{\tilde{\mathbf{x}}\}}k_{\sigma}(\mathbf{x}_{n},\mathbf{x}^{\prime}). (7)

This is equivalent to the objective of UHerding with U(𝐱)U(\mathbf{x}) as the choice of uncertainty. ∎

\alfamix

*

Proof.

ALFA-Mix selects closest data points to the center of kk-means clusters where kk-means is fitted with filtered data points. It keeps a data point 𝐱\mathbf{x} if it satisfies that  class j s.t. y^(αj(𝐱)g(𝐱)+(1αj(𝐱))g¯j;f)y^(g(𝐱);f)\exists\,\text{ class }j\text{ s.t. }\;\hat{y}\bigl{(}\alpha_{j}(\mathbf{x})\,g(\mathbf{x})+(1-\alpha_{j}(\mathbf{x}))\,\bar{g}_{j};f\bigr{)}\neq\hat{y}(g(\mathbf{x});f), which we can express as the indicator function in Equation 3.

With the replacement of kk-means with a greedy kernel kk-medoids objective, the objective of ALFA-Mix can be converted into:

𝐱\displaystyle\mathbf{x}^{*} argmax𝐱~𝒳1Nn=1NU(𝐱;f)max𝐱{𝐱~}kσ(𝐱n,𝐱).\displaystyle\in\operatorname*{arg\,max}_{\tilde{\mathbf{x}}\in\mathcal{X}}\frac{1}{N}\sum_{n=1}^{N}U(\mathbf{x};f)\cdot\max_{\mathbf{x}^{\prime}\in\mathcal{L}\cup\{\tilde{\mathbf{x}}\}}k_{\sigma}(\mathbf{x}_{n},\mathbf{x}^{\prime}).\qed
\badge

*

Proof.

Recall q(𝐱)=y^(𝐱;f)p^(𝐱;f)q(\mathbf{x})=\hat{y}(\mathbf{x};f)-\hat{p}(\mathbf{x};f). As 𝐱𝒰,p^(𝐱;f)1K1\forall\mathbf{x}\in\mathcal{U},\,\hat{p}(\mathbf{x};f)\rightarrow\frac{1}{K}\vec{1}, it is true that q(𝐱)2211K\left\lVert q(\mathbf{x})\right\rVert_{2}^{2}\rightarrow 1-\frac{1}{K} and q(𝐱),q(𝐱)𝟙[y^(𝐱;f)=y^(𝐱;f)]1K\langle q(\mathbf{x}),q(\mathbf{x}^{\prime})\rangle\rightarrow\mathbbm{1}[\hat{y}(\mathbf{x};f)=\hat{y}(\mathbf{x}^{\prime};f)]-\frac{1}{K}. With the assumption that 𝐱𝒳,kσ(𝐱,𝐱;g)=c\forall\mathbf{x}\in\mathcal{X},\,k_{\sigma}(\mathbf{x},\mathbf{x};g)=c,

h(𝐱n,𝐱)2(𝟙[y^(𝐱n;f)=y^(𝐱;f)]1K)kσ(𝐱n,𝐱;g)2c(11K).\displaystyle h(\mathbf{x}_{n},\mathbf{x}^{\prime})\rightarrow 2\left(\mathbbm{1}[\hat{y}(\mathbf{x}_{n};f)=\hat{y}(\mathbf{x}^{\prime};f)]-\frac{1}{K}\right)k_{\sigma}(\mathbf{x}_{n},\mathbf{x}^{\prime};g)-2c\left(1-\frac{1}{K}\right). (8)

Therefore, the following is true:

𝐱\displaystyle\mathbf{x}^{*} argmax𝐱~𝒰1Nn=1Nmax𝐱{𝐱~}h(𝐱n,𝐱)\displaystyle\in\operatorname*{arg\,max}_{\tilde{\mathbf{x}}\in\mathcal{U}}\frac{1}{N}\sum_{n=1}^{N}\max_{\mathbf{x}^{\prime}\in\mathcal{L}\cup\{\tilde{\mathbf{x}}\}}h(\mathbf{x}_{n},\mathbf{x}) (9)
=argmax𝐱~𝒰1Nn=1Nmax𝐱{𝐱~}(𝟙[y^(𝐱n;f)=y^(𝐱;f)]1K)kσ(𝐱n,𝐱;g).\displaystyle=\operatorname*{arg\,max}_{\tilde{\mathbf{x}}\in\mathcal{U}}\frac{1}{N}\sum_{n=1}^{N}\max_{\mathbf{x}^{\prime}\in\mathcal{L}\cup\{\tilde{\mathbf{x}}\}}\left(\mathbbm{1}[\hat{y}(\mathbf{x}_{n};f)=\hat{y}(\mathbf{x}^{\prime};f)]-\frac{1}{K}\right)k_{\sigma}(\mathbf{x}_{n},\mathbf{x}^{\prime};g). (10)

If σ0\sigma\rightarrow 0, kσ(𝐱,𝐱;g)=𝟙[𝐱=𝐱]k_{\sigma}(\mathbf{x},\mathbf{x}^{\prime};g)=\mathbbm{1}[\mathbf{x}=\mathbf{x}^{\prime}]. Then, max𝐱{𝐱~}h(𝐱n,𝐱)min𝐱{𝐱~}q(𝐱n)22+q(𝐱)22=min𝐱{𝐱~}y^(𝐱;f)p^(𝐱;f)22\max_{\mathbf{x}^{\prime}\in\mathcal{L}\cup\{\tilde{\mathbf{x}}\}}h(\mathbf{x}_{n},\mathbf{x}^{\prime})\rightarrow\min_{\mathbf{x}^{\prime}\in\mathcal{L}\cup\{\tilde{\mathbf{x}}\}}\left\lVert q(\mathbf{x}_{n})\right\rVert_{2}^{2}+\left\lVert q(\mathbf{x}^{\prime})\right\rVert_{2}^{2}=\min_{\mathbf{x}^{\prime}\in\mathcal{L}\cup\{\tilde{\mathbf{x}}\}}\left\lVert\hat{y}(\mathbf{x}^{\prime};f)-\hat{p}(\mathbf{x}^{\prime};f)\right\rVert_{2}^{2}. Then,

𝐱\displaystyle\mathbf{x}^{*} argmax𝐱~𝒰1Nn=1Nmax𝐱{𝐱~}h(𝐱n,𝐱)\displaystyle\in\operatorname*{arg\,max}_{\tilde{\mathbf{x}}\in\mathcal{U}}\frac{1}{N}\sum_{n=1}^{N}\max_{\mathbf{x}^{\prime}\in\mathcal{L}\cup\{\tilde{\mathbf{x}}\}}h(\mathbf{x}_{n},\mathbf{x}) (11)
=argmax𝐱~𝒰min𝐱{𝐱~}y^(𝐱;f)p^(𝐱;f)22\displaystyle=\operatorname*{arg\,max}_{\tilde{\mathbf{x}}\in\mathcal{U}}\min_{\mathbf{x}^{\prime}\in\mathcal{L}\cup\{\tilde{\mathbf{x}}\}}\left\lVert\hat{y}(\mathbf{x}^{\prime};f)-\hat{p}(\mathbf{x}^{\prime};f)\right\rVert_{2}^{2} (12)

Therefore, BADGE approaches to argmax𝐱~𝒰U′′(𝐱~)\operatorname*{arg\,max}_{\tilde{\mathbf{x}}\in\mathcal{U}}U^{\prime\prime}(\tilde{\mathbf{x}}) as σ0\sigma\rightarrow 0. ∎

Refer to caption
Figure 7: Comparison on CIFAR10 for supervised-learning tasks.

Appendix B Additional Comparison with State of the Art

In addition to Figure 4 where we compare state-of-the-art active learning methods on CIFAR100 and TinyImageNet datasets for supervised learning tasks, we provide Figure 7 for additional results on CIFAR10 dataset. Again, we employ a ResNet18 randomly initialized at each iteration.

Similar to the results on CIFAR100 and TinyImageNet datasets, UHerding significantly outperforms representation-based methods in high-budget regimes, and uncertainty-based methods in low- and mid-budget regimes, confirming the robustness of UHerding over other active learning methods.

Appendix C Component analysis of UHerding

In this ablation study, we examine the contribution of each component of UHerding to its overall generalization performance. Specifically, we incrementally add each of temperature scaling (Temp. Scaling) and adaptive radius (Radius Adap.) individually to the UHerding baseline, which selects data points based solely on the UHerding acquisition defined in Equation 2. We also evaluate the combined effect of both components. As in Section 4.1, we use CIFAR-10 dataset, and present the results using Δ\DeltaAcc relative to Random selection to clearly highlight the differences.

Although temperature scaling enhances performance over the UHerding baseline, its performance is still comparable to MaxHerding in the low-budget regime, and worse than Margin in the high-budget regime. Radius adaptation generally matches or exceeds MaxHerding across budget regimes but quickly converges to Margin’s performance in the high-budget regime. When both temperature scaling and radius adaptation are applied, performance surpasses both MaxHerding and Margin across all budget regimes, with saturation in the high-budget regime occurring significantly more gradually than with radius adaptation alone. Please note that Margin’s performance in the low- and mid-budget regimes is not visible due to its exceptionally poor results relative to other methods.

Refer to caption
(a) Comparison with and without components of parameter adaptation.
Refer to caption
(b) Clustering methods
Figure 8: Comparison of some components in UHerding (Left) and clustering methods (Right)

Appendix D Comparison of Clustering Methods

As noted in Section 2, we compare several clustering methods applied to BADGE Ash et al. (2020) to justify the replacement of kk-means and kk-means++ of existing active learning methods with MaxHerding. Figure 8(b) compares kk-means, kk-means++, kk-medoids (iterative optimization), partition around medoids (PAM), and MaxHerding, a greedy kernel kk-medoids. We train a ResNet18 randomly initialized at each iteration on CIFAR100.

Surprisingly, kk-medoids performs slightly worse than kk-means, showing that searching for medoids is not necessarily better than search of means. However, the gap between PAM and kk-medoids shows that optimization methods do make significant changes. Although PAM works the best overall, MaxHerding is comparable with much less computation Bae et al. (2024). It justifies the replacement of kk-means and kk-means++ with MaxHerding for existing active learning methods.