This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\floatsetup

[table]capposition=top \newfloatcommandcapbtabboxtable[][\FBwidth]

Enhancing Semi-Supervised Learning via Representative and Diverse Sample Selection

Qian Shao1,3 , Jiangrui Kang211footnotemark: 1 , Qiyuan Chen1,311footnotemark: 1 , Zepeng Li4, Hongxia Xu1,3,
Yiwen Cao2, Jiajuan Liang2 , and Jian Wu122footnotemark: 2
1College of Computer Science &\& Technology and Liangzhu Laboratory, Zhejiang University
2BNU-HKBU United International College  3WeDoctor Cloud
4The State Key Laboratory of Blockchain and Data Security, Zhejiang University
{qianshao, qiyuanchen, lizepeng, einstein, wujian2000}@zju.edu.cn
{kangjiangrui, yiwencao, jiajuanliang}@uic.edu.cn
These authors contributed equally to this work.Corresponding authors.
Abstract

Semi-Supervised Learning (SSL) has become a preferred paradigm in many deep learning tasks, which reduces the need for human labor. Previous studies primarily focus on effectively utilising the labelled and unlabeled data to improve performance. However, we observe that how to select samples for labelling also significantly impacts performance, particularly under extremely low-budget settings. The sample selection task in SSL has been under-explored for a long time. To fill in this gap, we propose a Representative and Diverse Sample Selection approach (RDSS). By adopting a modified Frank-Wolfe algorithm to minimise a novel criterion α\alpha-Maximum Mean Discrepancy (α\alpha-MMD), RDSS samples a representative and diverse subset for annotation from the unlabeled data. We demonstrate that minimizing α\alpha-MMD enhances the generalization ability of low-budget learning. Experimental results show that RDSS consistently improves the performance of several popular SSL frameworks and outperforms the state-of-the-art sample selection approaches used in Active Learning (AL) and Semi-Supervised Active Learning (SSAL), even with constrained annotation budgets. Our code is available at RDSS.

1 Introduction

Semi-Supervised Learning (SSL) is a popular paradigm which reduces reliance on large amounts of labeled data in many deep learning tasks [40, 35, 59]. Previous SSL research mainly focuses on effectively utilising labelled and unlabeled data. Specifically, labelled data directly supervise model learning, while unlabeled data help learn a desirable model that makes consistent and unambiguous predictions [53]. Besides, we also find that how to select samples for annotation will greatly affect model performance, particularly under extremely low-budget settings (see Section 7.2).

The prevailing sample selection methods in SSL have many shortcomings. For example, random sampling may introduce imbalanced class distributions and inadequate coverage of the overall data distribution, resulting in poor performance. Stratified sampling randomly selects samples within each class, which is impractical in real-world scenarios where the label for each sample is unknown. Existing researchers also employ representativeness and diversity strategies to select appropriate samples for annotation. Representativeness [13] ensures that the selected subset distributes similarly with the entire dataset, and diversity [54] is designed to select informative samples by pushing away them in feature space. And focusing on only one aspect presents significant limitations (Figure 1a and b). To address these issues, Xie et al. [57] and Wang et al. [50] employ a combination of the two strategies for sample selection. These methods set a fixed ratio for representativeness and diversity, restricting the ultimate performance through our empirical evidence (see Section 7.4). Fundamentally, they lack a theoretical basis to substantiate their effectiveness.

Refer to caption
Figure 1: Visualization of selected samples from a dog dataset. The red and grey circles respectively symbolize the selected and unselected samples. a) The selected samples often contain an excessive number of highly similar instances, leading to redundancy; b) The selected samples contain too many edge points, unable to cover the entire dataset; c) The selected samples represent the entire dataset comprehensively and accurately.

We observe that Active Learning (AL) primarily focuses on selecting the right samples for annotation, and numerous studies transfer the sample selection methods of AL into SSL, giving rise to Semi-Supervised Active Learning (SSAL) [51]. However, most of these approaches exhibit several limitations: (1) They require randomly selected samples to begin with, which expends a portion of the labelling budget, making it difficult to work effectively with a very limited budget (e.g., 1% or even lower) [6]; (2) They involve human annotators in iterative cycles of labelling and training, leading to substantial labelling overhead [57]; (3) They are coupled with the model training so that samples for annotation need to be re-selected every time a model is trained [50]. In summary, selecting the appropriate samples for annotation is challenging in SSL.

To address these challenges, we propose a Representative and Diverse Sample Selection approach (RDSS) that requests annotations only once and operates independently of the downstream tasks. Specifically, inspired by the concept of Maximum Mean Discrepancy (MMD) [14], we design a novel criterion named α\alpha-MMD. It aims to strike a balance between representativeness and diversity via a trade-off parameter α\alpha (Figure 1c), for which we find an optimal interval adapt to different budgets. By using a modified Frank-Wolfe algorithm called Generalized Kernel Herding without Replacement (GKHR), we can get an efficient approximate solution to this minimization problem.

We prove that under certain Reproducing Kernel Hilbert Space (RKHS) assumptions, α\alpha-MMD effectively bounds the difference between training with a constrained versus an unlimited labelling budget. This implies that our proposed method could significantly enhance the generalization ability of learning with limited labels. We also give a theoretical assessment of GKHR with some supplementary numerical experiments, showing that GKHR performs well in learning with limited labels.

Furthermore, we evaluate our proposed RDSS across several popular SSL frameworks on the datasets CIFAR-10/100 [19], SVHN [30], STL-10 [9] and ImageNet [10]. Extensive experiments show that RDSS outperforms other sample selection methods widely used in SSL, AL or SSAL, especially with a constrained annotation budget. Besides, ablation experimental results demonstrate that RDSS outperforms methods using a fixed ratio.

The main contributions of this article are as follows:

  • We propose RDSS, which selects representative and diverse samples for annotation to enhance SSL by minimizing a novel criterion α\alpha-MMD. Under low-budget settings, we develop a fast and efficient algorithm, GKHR, for optimization.

  • We prove that our method benefits the generalizability of the trained model under certain assumptions and rigorously establish an optimal interval for the trade-off parameter α\alpha adapt to the different budgets.

  • We compare RDSS with sample selection strategies widely used in SSL, AL or SSAL, the results of which demonstrate superior sample efficiency compared to these strategies. In addition, we conduct ablation experiments to verify our method’s superiority over the fixed-ratio approach.

2 Related Work

Semi-Supervised Learning

Semi-Supervised Learning (SSL) effectively utilizes sparse labeled data and abundant unlabeled data for model training. Consistency Regularization [34, 20, 45], Pseudo-Labeling [21, 56] and their hybrid strategies [40, 63, 35] are commonly used in SSL. Consistency Regularization ensures the model’s output stays stable even when there’s noise or small changes in the input, usually from the data augmentation [55]. Pseudo-labelling integrates high-confidence data pseudo-labels directly into training, adhering to entropy minimization [23]. Moreover, an integrative approach that combines the aforementioned strategies can also achieve substantial results [53, 59]. Even though these approaches have been proven effective, they usually assume that labelled samples are randomly selected from each class (i.e., stratified sampling), which is not practical in real-world scenarios where the label for each sample is unknown.

Active Learning

Active learning (AL) aims to optimize the learning process by selecting the appropriate samples for labelling, reducing reliance on large labelled datasets. There are two different criteria for sample selection: uncertainty and representativeness. Uncertainty sampling selects samples about which the current model is most uncertain. Earlier studies utilized posterior probability [22, 49], entropy [18, 26], and classification margin [47] to estimate uncertainty. Recent research regards uncertainty as training loss [17, 60], influence on model performance [11, 24] or the prediction discrepancies between multiple classifiers [8]. However, uncertainty sampling methods may exhibit performance disparities across different models, leading researchers to focus on representativeness sampling, which aims to align the distribution of selected subset with that of the entire dataset [36, 39, 27]. Most AL approaches are difficult to perform well under extremely low-label settings. This may be because they usually require randomly selected samples to begin with and involve human annotators in iterative cycles of labelling and training, leading to substantial labelling overhead.

Model-Free Subsampling

Subsampling is a statistical approach which selects a subset with size mm as a surrogate for the full dataset with size nmn\gg m. While model-based subsampling methods depend heavily on the model assumptions [1, 61], improper choice of the model could lead to bad performance of estimation and prediction. In that case, model-free subsampling is preferred in data-driven modelling tasks, as it does not depend on the model assumptions. There are mainly two kinds of popular model-free subsampling methods. The one is induced by minimizing statistical discrepancies, which forces the distribution of subset to be similar to that of full data, in other words, selects representative subsamples, such as Wasserstein distance [13], energy distance [28], uniform design [65], maximum mean discrepancy [7] and generalized empirical FF-discrepancy [66]. The other tends to select a diverse subset containing as many informative samples as possible [54]. The above-mentioned methodologies either exclusively focus on representativeness or diversity, which are difficult to effectively apply to SSL.

3 Problem Setup

Let 𝒳\mathcal{X} be the unlabeled data space, 𝒴\mathcal{Y} be the label space, 𝐗n={𝐱i}i[n]𝒳\mathbf{X}_{n}=\{\mathbf{x}_{i}\}_{i\in[n]}\subset\mathcal{X} be the full unlabeled dataset and m={i1,i2,,im}[n](m<n)\mathcal{I}_{m}=\{i_{1},i_{2},\cdots,i_{m}\}\subset[n](m<n) be an index set contained in [n][n], our goal is to find an index set m={i1,i2,,im}[n](m<n)\mathcal{I}^{*}_{m}=\{i^{*}_{1},i^{*}_{2},\cdots,i^{*}_{m}\}\subset[n](m<n) such that the selected set of samples 𝐗m={𝐱i1,𝐱i2,,𝐱im}\mathbf{X}_{\mathcal{I}^{*}_{m}}=\{\mathbf{x}_{i^{*}_{1}},\mathbf{x}_{i^{*}_{2}},\cdots,\mathbf{x}_{i^{*}_{m}}\} is the most informative. After that, we can get access to the true labels of selected samples and use the set of labelled data S={(𝐱i,yi)}imS=\{(\mathbf{x}_{i},y_{i})\}_{i\in\mathcal{I}^{*}_{m}} and the rest of the unlabeled data to train a deep learning model.

Following the methodology of previous works, we use representativeness and diversity as criteria for evaluating the informativeness of selected samples. Representativeness ensures the selected samples distribute similarly to the full unlabeled dataset. Diversity is proposed to prevent an excessive concentration of selected samples in high-density areas of the full unlabeled dataset. Furthermore, the cluster assumption in SSL suggests that the data tend to form discrete clusters, in which boundary points are likely to be located in the low-density area. Therefore, under this assumption, selected samples with diversity contain more boundary points than the non-diversified ones, which is desired in training classifiers.

As a result, our goal can be formulated by solving the following problem:

maxm[n]Rep(𝐗m,𝐗n)+λDiv(𝐗m,𝐗n),\max_{\mathcal{I}_{m}\subset[n]}\text{Rep}(\mathbf{X}_{\mathcal{I}_{m}},\mathbf{X}_{n})+\lambda\text{Div}(\mathbf{X}_{\mathcal{I}_{m}},\mathbf{X}_{n}), (1)

where Rep(𝐗m,𝐗n)\text{Rep}(\mathbf{X}_{\mathcal{I}_{m}},\mathbf{X}_{n}) and Div(𝐗m,𝐗n)\text{Div}(\mathbf{X}_{\mathcal{I}_{m}},\mathbf{X}_{n}) quantify the representativeness and diversity of selected samples respectively and λ\lambda is a hyperparameter to balance the trade-off representativeness and diversity.

Besides, we propose another two fundamental settings which are beneficial to the implementation of the framework: (1) Low-budget learning. The budget for many of the real-world tasks which require sample selection procedures is relatively low compared to the size of unlabeled data. Therefore, we set m/n0.2m/n\leq 0.2 in default in the following context, including the analysis of the sampling algorithm and the experiments; (2) Sampling without Replacement. Compared with the setting of sampling with replacement, sampling without replacement offers several benefits which better match our tasks, including bias and variance reduction, precision increase and representativeness enhancement [25, 46].

4 Representative and Diversity Sample Selection

The Representative and Diverse Sample Selection (RDSS) framework consists of two steps: (1) Quantification. We quantify the representativeness and diversity of selected samples by a novel concept called α\alpha-MMD (6), where λ\lambda is replaced by α\alpha as the trade-off hyperparameter; (2) Optimization. We optimize α\alpha-MMD by GKHR algorithm to obtain the optimally selected samples 𝐗m\mathbf{X}_{\mathcal{I}^{*}_{m}}.

4.1 Quantification of Diversity and Representativeness

In classical statistics and machine learning problems, the inner product of data points 𝐱,𝐲𝒳\mathbf{x},\mathbf{y}\in\mathcal{X}, defined by 𝐱,𝐲\langle\mathbf{x},\mathbf{y}\rangle, is employed to as a similarity measure between 𝐱,𝐲\mathbf{x},\mathbf{y}. However, the application of linear functions can be very restrictive in real-world problems. In contrast, kernel methods use kernel functions k(𝐱,𝐲)k(\mathbf{x},\mathbf{y}), including Gaussian kernels (RBF), Laplacian kernels and polynomial kernels, as non-linear similarity measures between 𝐱,𝐲\mathbf{x},\mathbf{y}, which are actually inner products of the projections of k(𝐱,𝐲)k(\mathbf{x},\mathbf{y}) in some high-dimensional feature space [29].

Let k(,)k(\cdot,\cdot) be a kernel function on 𝒳×𝒳\mathcal{X}\times\mathcal{X}, and we employ k(,)k(\cdot,\cdot) to measure the similarity between any two points and the average similarity, denoted by

Sk(𝐗m)=1m2imjmk(𝐱i,𝐱j),S_{k}(\mathbf{X}_{\mathcal{I}_{m}})=\frac{1}{m^{2}}\sum_{i\in\mathcal{I}_{m}}\sum_{j\in\mathcal{I}_{m}}k\left(\mathbf{x}_{i},\mathbf{x}_{j}\right), (2)

to measure the similarity between the selected samples. Obviously, S(𝐗m)S(\mathbf{X}_{\mathcal{I}_{m}}) can evaluate the diversity of 𝐗m\mathbf{X}_{\mathcal{I}_{m}} since larger similarity implies smaller diversity.

As a statistical discrepancy which measures the distance between distributions, the maximum mean discrepancy (MMD) is introduced here to quantify the representativeness of 𝐗m\mathbf{X}_{\mathcal{I}_{m}} to 𝐗n\mathbf{X}_{n}. Proposed by Gretton et al. [14], MMD is formally defined below:

Definition 4.1 (Maximum Mean Discrepancy).

Let P,QP,Q be two Borel probability measures on 𝒳\mathcal{X}. Suppose ff is sampled from the unit ball in a reproducing kernel Hilbert space (RKHS) \mathcal{H} associated with its reproducing kernel k(,)k(\cdot,\cdot), i.e., f1\|f\|_{\mathcal{H}}\leq 1, then the MMD between PP and QQ is defined by

MMDk2(P,Q):=supf1(f𝑑Pf𝑑Q)2=𝔼[k(X,X)+k(Y,Y)2k(X,Y)],\operatorname{MMD}_{k}^{2}(P,Q):=\sup_{\|f\|_{\mathcal{H}}\leq 1}\left(\int fdP-\int fdQ\right)^{2}=\mathbb{E}\left[k\left(X,X^{\prime}\right)+k\left(Y,Y^{\prime}\right)-2k(X,Y)\right], (3)

where X,XPX,X^{\prime}\sim P and Y,YQY,Y^{\prime}\sim Q are independent copies.

We can next derive the empirical version for MMD that is able to measure the representativeness of 𝐗m={𝐱i}im\mathbf{X}_{\mathcal{I}_{m}}=\{\mathbf{x}_{i}\}_{i\in\mathcal{I}_{m}} relative to 𝐗n={𝐱i}i=1n\mathbf{X}_{n}=\{\mathbf{x}_{i}\}_{i=1}^{n} by replacing P,QP,Q with the empirical distribution constructed by 𝐗m,𝐗n\mathbf{X}_{\mathcal{I}_{m}},\mathbf{X}_{n} in (3):

MMDk2(𝐗m,𝐗n):=1n2i=1nj=1nk(𝐱i,𝐱j)+1m2imjmk(𝐱i,𝐱j)2mni=1njmk(𝐱i,𝐱j).\operatorname{MMD}_{k}^{2}(\mathbf{X}_{\mathcal{I}_{m}},\mathbf{X}_{n}):=\frac{1}{n^{2}}\sum_{i=1}^{n}\sum_{j=1}^{n}k\left(\mathbf{x}_{i},\mathbf{x}_{j}\right)+\frac{1}{m^{2}}\sum_{i\in\mathcal{I}_{m}}\sum_{j\in\mathcal{I}_{m}}k\left(\mathbf{x}_{i},\mathbf{x}_{j}\right)-\frac{2}{mn}\sum_{i=1}^{n}\sum_{j\in\mathcal{I}_{m}}k\left(\mathbf{x}_{i},\mathbf{x}_{j}\right). (4)

Optimization objective. Set Rep(,)=MMDk2(,)\text{Rep}(\cdot,\cdot)=-\operatorname{MMD}_{k}^{2}(\cdot,\cdot) and Div()=Sk()\text{Div}(\cdot)=-S_{k}(\cdot) in (1), where kk is a proper kernel function, our optimization objective becomes

minm[n]MMDk2(𝐗m,𝐗n)+λSk(𝐗m).\min_{\mathcal{I}_{m}\subset[n]}\operatorname{MMD}_{k}^{2}(\mathbf{X}_{\mathcal{I}_{m}},\mathbf{X}_{n})+\lambda S_{k}(\mathbf{X}_{\mathcal{I}_{m}}). (5)

Set λ=1ααm\lambda=\frac{1-\alpha}{\alpha m}, since i=1nj=1nk(𝐱i,𝐱j)\sum_{i=1}^{n}\sum_{j=1}^{n}k\left(\mathbf{x}_{i},\mathbf{x}_{j}\right) is a constant, the objective function in (5) can be rewritten by

αMMDk2(𝐗m,𝐗n)+1αmSk(𝐗m)+α(α1)n2i=1nj=1nk(𝐱i,𝐱j)\displaystyle\alpha\operatorname{MMD}_{k}^{2}(\mathbf{X}_{\mathcal{I}_{m}},\mathbf{X}_{n})+\frac{1-\alpha}{m}S_{k}(\mathbf{X}_{\mathcal{I}_{m}})+\frac{\alpha(\alpha-1)}{n^{2}}\sum_{i=1}^{n}\sum_{j=1}^{n}k\left(\mathbf{x}_{i},\mathbf{x}_{j}\right) (6)
=\displaystyle= α2n2i=1nj=1nk(𝐱i,𝐱j)+1m2imjmk(𝐱i,𝐱j)2αmni=1njmk(𝐱i,𝐱j)\displaystyle\frac{\alpha^{2}}{n^{2}}\sum_{i=1}^{n}\sum_{j=1}^{n}k\left(\mathbf{x}_{i},\mathbf{x}_{j}\right)+\frac{1}{m^{2}}\sum_{i\in\mathcal{I}_{m}}\sum_{j\in\mathcal{I}_{m}}k\left(\mathbf{x}_{i},\mathbf{x}_{j}\right)-\frac{2\alpha}{mn}\sum_{i=1}^{n}\sum_{j\in\mathcal{I}_{m}}k\left(\mathbf{x}_{i},\mathbf{x}_{j}\right)
=\displaystyle= supf1(1mimf(𝐱i)αnj=1nf(𝐱j))2\displaystyle\sup_{\|f\|_{\mathcal{H}}\leq 1}\left(\frac{1}{m}\sum_{i\in\mathcal{I}_{m}}f(\mathbf{x}_{i})-\frac{\alpha}{n}\sum_{j=1}^{n}f(\mathbf{x}_{j})\right)^{2}

which defines a new concept called α\alpha-MMD, denoted by MMDk,α(𝐗m,𝐗n)\operatorname{MMD}_{k,\alpha}(\mathbf{X}_{\mathcal{I}_{m}},\mathbf{X}_{n}). This new concept distinguishes our method from those existing methods, which is essential for developing the sampling algorithms and theoretical analysis. Note that α\alpha-MMD degenerates to classical MMD when α=1\alpha=1 and degenerates to average similarity when α=0\alpha=0. As α\alpha decreases, λ\lambda increases, thereby encouraging the diversity for sample selection.

Remark 1. In the following context, all the kernels are assumed to be characteristic and positive definite if not specified. The following illustrates the advantages of the two properties.

Characteristics kernels. The MMD is generally a pseudo-metric on the space of all Borel probability distributions, implying that the MMD between two different distributions can be zero. Nevertheless, MMD becomes a proper metric when kk is a characteristic kernel, i.e., P𝒳k(,𝐱)𝑑PP\rightarrow\int_{\mathcal{X}}k(\cdot,\mathbf{x})dP for any Borel probability distribution PP on 𝒳\mathcal{X} [29]. Therefore, MMD induced by characteristic kernels can be more appropriate for measuring representativeness.

Positive definite kernels. Aronszajn [2] showed that for every positive definite kernel k(,)k(\cdot,\cdot), i.e., its Gram matrix is always positive definite and symmetric, it uniquely determines an RKHS \mathcal{H} and vice versa. This property is not only important for evaluating the property of MMD [43] but also required in optimizing MMD [32] by Frank-Wolfe algorithm.

4.2 Sampling Algorithm

In the previous research [36, 27, 50, 38, 58], sample selection is usually modelled by a non-convex combinatorial optimization problem. In contrast, following the idea of [4], we regard minm[n]MMDk,α2(𝐗m,𝐗n)\min_{\mathcal{I}_{m}\in[n]}\operatorname{MMD}^{2}_{k,\alpha}(\mathbf{X}_{\mathcal{I}_{m}},\mathbf{X}_{n}) as a convex optimization problem by exploiting the convexity of α\alpha-MMD, and then solve it by a fast iterative minimization procedure derived from Frank-Wolfe algorithm (see Appendix A for derivation details):

𝐱ip+1argmini[n]fp(𝐱i),p+1p{ip+1},0=,\mathbf{x}_{i^{*}_{p+1}}\in\mathop{\arg\min}_{i\in[n]}f_{\mathcal{I}^{*}_{p}}(\mathbf{x}_{i}),\mathcal{I}^{*}_{p+1}\leftarrow\mathcal{I}^{*}_{p}\cup\{{i^{*}_{p+1}}\},\mathcal{I}_{0}=\emptyset, (7)

where fp(𝐱i)=jpk(𝐱i,𝐱j)αpl=1nk(𝐱i,𝐱l)/nf_{\mathcal{I}_{p}}(\mathbf{x}_{i})=\sum_{j\in\mathcal{I}_{p}}k\left(\mathbf{x}_{i},\mathbf{x}_{j}\right)-\alpha p\sum_{l=1}^{n}k(\mathbf{x}_{i},\mathbf{x}_{l})/n. As an extension of kernel herding [7], its corresponding algorithm (see Algorithm 2) is called Generalized Kernel Herding (GKH). Note that fp(𝐱i)f_{\mathcal{I}_{p}}(\mathbf{x}_{i}) is iteratively updated in Algorithm 2, which can save a lot of running time. However, GKH can select repeated samples that contradict the setting of sampling without replacement. To address this issue, we propose a modified iterating formula based on (7):

𝐱ip+1argmini[n]\pfp(𝐱i),p+1p{ip+1},0=,\mathbf{x}_{i^{*}_{p+1}}\in\mathop{\arg\min}_{i\in[n]\backslash\mathcal{I}^{*}_{p}}f_{\mathcal{I}^{*}_{p}}(\mathbf{x}_{i}),\mathcal{I}^{*}_{p+1}\leftarrow\mathcal{I}^{*}_{p}\cup\{{i^{*}_{p+1}}\},\mathcal{I}^{*}_{0}=\emptyset, (8)

which admits no repetitiveness in the selected samples. Its corresponding algorithm (see Algorithm 1) is thereby named as Generalized Kernel Herding without Replacement (GKHR), employed as the sampling algorithm for RDSS.

Algorithm 1 Generalized Kernel Herding without Replacement
0:  Data set 𝐗n={𝐱1,,𝐱n}𝒳\mathbf{X}_{n}=\{\mathbf{x}_{1},\cdots,\mathbf{x}_{n}\}\subset\mathcal{X}; the number of selected samples m<nm<n; a positive definite, characteristic and radial kernel k(,)k(\cdot,\cdot) on 𝒳×𝒳\mathcal{X}\times\mathcal{X}; trade-off parameter α1\alpha\leq 1.
0:  Selected samples 𝐗m={𝐱i1,,𝐱im}\mathbf{X}_{\mathcal{I}^{*}_{m}}=\{\mathbf{x}_{i^{*}_{1}},\cdots,\mathbf{x}_{i^{*}_{m}}\}.
1:  For each 𝐱i𝐗n\mathbf{x}_{i}\in\mathbf{X}_{n} calculate μ(𝐱i):=j=1nk(𝐱j,𝐱i)/n\mu(\mathbf{x}_{i}):=\sum_{j=1}^{n}k(\mathbf{x}_{j},\mathbf{x}_{i})/n.
2:  Set β1=1\beta_{1}=1, S0=0S_{0}=0, =\mathcal{I}=\emptyset.
3:  for p{1,,m}p\in\{1,\cdots,m\} do
4:     ipargmini[n]\pSp1(𝐱i)αμ(𝐱i){i^{*}_{p}}\in\mathop{\arg\min}_{i\in[n]\backslash\mathcal{I}^{*}_{p}}S_{p-1}(\mathbf{x}_{i})-\alpha\mu(\mathbf{x}_{i})
5:     For all i[n]\pi\in[n]\backslash\mathcal{I}^{*}_{p}, update Sp(𝐱i)=(1βp)Sp1(𝐱i)+βpk(𝐱ip,𝐱i)S_{p}(\mathbf{x}_{i})=(1-\beta_{p})S_{p-1}(\mathbf{x}_{i})+\beta_{p}k(\mathbf{x}_{i^{*}_{p}},\mathbf{x}_{i})
6:     p+1p{ip}{\mathcal{I}^{*}_{p+1}}\leftarrow{\mathcal{I}^{*}_{p}}\cup\{{i^{*}_{p}}\}, pp+1p\leftarrow p+1, set βp=1/p\beta_{p}=1/p.
7:  end for

Computational complexity. Despite the time cost for calculating kernel functions, the computational complexity of GKHR is O(mn)O(mn), since in each iteration, the steps in lines 4 and 5 of Algorithm 2 respectively require O(n)O(n) computations. Note that GKH has the same order of computational complexity as GKHR.

5 Theoretical Analysis

5.1 Generalization Bounds

Recall the core-set approach in [36], i.e., for any hh\in\mathcal{H},

R(h)R^S(h)+|R(h)R^T(h)|+|R^T(h)R^S(h)|,R(h)\leq\widehat{R}_{S}(h)+|R(h)-\widehat{R}_{T}(h)|+|\widehat{R}_{T}(h)-\widehat{R}_{S}(h)|,

where TT is the full labeled dataset and STS\subset T is the core set, R(h)R(h) is the expected risk of hh, R^T(h),R^S(h)\widehat{R}_{T}(h),\widehat{R}_{S}(h) are empirical risk of hh on T,ST,S. The first term R^S(h)\widehat{R}_{S}(h) is unknown before we label the selected samples, and the second term |R(h)R^T(h)||R(h)-\widehat{R}_{T}(h)| can be upper bounded by the so-called generalization bounds [3, 64] which do not depend on the choice of core set. Therefore, to control the upper bound of R(h)R(h), we only need to analyse the upper bound of the third term |R^T(h)R^S(h)||\widehat{R}_{T}(h)-\widehat{R}_{S}(h)| called core-set loss, which requires several mild assumptions. Shalit, et al. [37] derived a MMD-type upper bound for |R^T(h)R^S(h)||\widehat{R}_{T}(h)-\widehat{R}_{S}(h)| to estimate individual treatment effect, while our bound is generalized to a wider range of tasks.

Let 1={h|h:𝒳𝒴}\mathcal{H}_{1}=\{h|h:\mathcal{X}\rightarrow\mathcal{Y}\} be a hypothesis set in which we are going to select a predictor and suppose that the labelled data T={(𝐱i,yi)}i=1nT=\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n} are i.i.d. sampled from a random vector (X,Y)(X,Y) defined on 𝒳×𝒴\mathcal{X}\times\mathcal{Y}. We firstly assume that 1\mathcal{H}_{1} is an RKHS, which is mild in machine learning theory [3, 5].

Assumption 5.1.

1\mathcal{H}_{1} is an RKHS associated with bounded positive definite kernel k1k_{1} where the norm of any h1h\in\mathcal{H}_{1} is bounded by KhK_{h}.

We further make RKHS assumptions on the functional space of 𝔼(Y|X)\mathbb{E}(Y|X) and Var(Y|X)\operatorname{Var}(Y|X) that are fundamental in the field of conditional distribution embedding [41, 43].

Assumption 5.2.

There is an RKHS 2\mathcal{H}_{2} associated with bounded positive definite kernel k2k_{2} such that 𝔼(Y|X)2\mathbb{E}(Y|X)\in\mathcal{H}_{2} and the norm of any 𝔼(Y|X)\mathbb{E}(Y|X) is bounded by KmK_{m}.

Assumption 5.3.

There is an RKHS 3\mathcal{H}_{3} associated with bounded positive definite kernel k3k_{3} such that Var(Y|X)3\operatorname{Var}(Y|X)\in\mathcal{H}_{3} and the norm of any Var(Y|X)\operatorname{Var}(Y|X) is bounded by KsK_{s}.

We next give a α\alpha-MMD-type upper bound for the core-set loss by the following theorem:

Theorem 5.4.

Take k=k12+k1k2+k3k=k_{1}^{2}+k_{1}k_{2}+k_{3}, then under assumptions 1-3, for any selected samples STS\subset T, there exists a positive constant KcK_{c} such that the following inequality holds:

|R^T(h)R^S(h)|Kc(MMDk,α(𝐗S,𝐗T)+(1α)K)2,|\widehat{R}_{T}(h)-\widehat{R}_{S}(h)|\leq K_{c}(\operatorname{MMD}_{k,\alpha}(\mathbf{X}_{S},\mathbf{X}_{T})+(1-\alpha)\sqrt{K})^{2},

where 0α10\leq\alpha\leq 1, 0max𝐱𝒳k(𝐱,𝐱)=K0\leq\max_{\mathbf{x}\in\mathcal{X}}k(\mathbf{x},\mathbf{x})=K and 𝐗S,𝐗T\mathbf{X}_{S},\mathbf{X}_{T} are projections of S,TS,T on 𝒳\mathcal{X}.

Therefore, minimizing α\alpha-MMD can optimize the generalization bound for R(h)R(h) and benefit the generalizability of the trained model (predictor).

5.2 Finite-Sample-Error-Bound for GKHR

The concept of convergence does not apply to analyzing GKHR. With nn fixed, GKHR iterates for at most nn times and then returns 𝐗n=𝐗n\mathbf{X}_{\mathcal{I}^{*}_{n}}=\mathbf{X}_{n}. Consequently, we analyze the performance of GKHR by its finite-sample-error bound. Previous to that, we make an assumption on the mean of fpf_{\mathcal{I}^{*}_{p}} over the full unlabeled dataset.

Assumption 5.5.

For any p\mathcal{I}^{*}_{p} returned by GKHR, 1pm11\leq p\leq m-1, there exists p+1p+1 elements {𝐱jl}l=1p+1\{\mathbf{x}_{j_{l}}\}_{l=1}^{p+1} in 𝐗n\mathbf{X}_{n} such that

fp(𝐱j1)fp(𝐱jp+1)i=1nfp(𝐱i)n.f_{\mathcal{I}^{*}_{p}}(\mathbf{x}_{j_{1}})\leq\cdots f_{\mathcal{I}^{*}_{p}}(\mathbf{x}_{j_{p+1}})\leq\frac{\sum_{i=1}^{n}f_{\mathcal{I}^{*}_{p}}(\mathbf{x}_{i})}{n}.

When mm is not relatively small, this assumption is rather unrealistic. Nevertheless, under our low-budget setting, especially when mnm\ll n, the assumption becomes an extension of the principle that "the minimum is never larger than the mean", which still probably makes sense. We can then show that the decaying rate for optimization error of GKHR can be upper bounded by O(logm/m)O(\log m/m):

Theorem 5.6.

Let 𝐗m\mathbf{X}_{\mathcal{I}^{*}_{m}} be the samples selected by GKHR, under assumption 4, it holds that

MMDk,α2(𝐗m,𝐗n)Cα2+B2+logmm+1\operatorname{MMD}^{2}_{k,\alpha}\left(\mathbf{X}_{\mathcal{I}^{*}_{m}},\mathbf{X}_{n}\right)\leq C^{2}_{\alpha}+B\frac{2+\log m}{m+1} (9)

where B=2KB=2K, 0max𝐱𝒳k(𝐱,𝐱)=K0\leq\max_{\mathbf{x}\in\mathcal{X}}k(\mathbf{x},\mathbf{x})=K, Cα2=(1α)2K¯C^{2}_{\alpha}=(1-\alpha)^{2}\overline{K} where K¯\overline{K} is defined in Lemma B.6.

6 Choice of Kernel and Hyperparameter Tuning

In this section, we make some suggestions for choosing the kernel and tuning the hyperparameter α\alpha.

Choice of kernel. Recall Remark 1 in Section 4.1, we only consider characteristic and positive definite kernels in RDSS. Since the Gaussian kernels are the most commonly used kernels in the field of machine learning and statistics [3, 15], we introduce Gaussian kernel as our choice, which is defined by k(𝐱,𝐲)=exp(𝐱𝐲22)/σ2k(\mathbf{x},\mathbf{y})=\exp(-\|\mathbf{x}-\mathbf{y}\|_{2}^{2})/\sigma^{2}. The bandwidth parameter σ\sigma is set to be the median distance between samples in the aggregate dataset [15], i.e., σ=Median({𝐱𝐲2|𝐱,𝐲𝐗n})\sigma=\operatorname{Median}(\{\|\mathbf{x}-\mathbf{y}\|_{2}|\mathbf{x},\mathbf{y}\in\mathbf{X}_{n}\}), since the median is robust and also compromises between extreme cases.

Tuning trade-off hyperparameter α\alpha. According to Theorem 5.6 and Lemma B.3, by straightforward deduction we have

MMDk(𝐗m,𝐗n)Cα+𝒪(logmm)+(1α)K\operatorname{MMD}_{k}\left(\mathbf{X}_{\mathcal{I}^{*}_{m}},\mathbf{X}_{n}\right)\leq C_{\alpha}+\mathcal{O}\left(\sqrt{\frac{\log m}{m}}\right)+(1-\alpha)\sqrt{K}

to upper bound the MMD between the selected samples and the full dataset under a low-budget setting. We can just set α[11m,1)\alpha\in[1-\frac{1}{\sqrt{m}},1) so that the upper bound of the MMD would not be larger than the one of α\alpha-MMD in the perspective of the order of magnitude.

7 Experiments

In this section, we first explain the implementation details of our method RDSS in Section 7.1. Next, we compare RDSS with other sampling methods by integrating them into two state-of-the-art (SOTA) SSL approaches (FlexMatch [63] and Freematch [53]) on five datasets (CIFAR-10/100, SVHN, STL-10 and ImageNet-1k) in Section 7.2. The details of the datasets, the visualization results and the computational complexity of different sampling methods are shown in Appendix D.2D.3, and D.4, respectively. We also compare against various AL/SSAL approaches in Section 7.3. Lastly, we make quantitative analyses of the trade-off parameter α\alpha in Section 7.4.

7.1 Implementation Details of Our Method

First, we leverage the pre-trained image feature extraction capabilities of CLIP [33], a vision transformer architecture, to extract features. Subsequently, the [CLS] token features produced by the model’s final output are employed for sample selection. During the sample selection phase, the Gaussian kernel function is chosen as the kernel method to compute the similarity of samples in an infinite-dimensional feature space. The value of σ\sigma for the Gaussian kernel function is set as explained in Section 6. To ensure diversity in the sampled data, we introduce a penalty factor given by α=11m\alpha=1-\frac{1}{\sqrt{m}}, where mm denotes the number of selected samples. Concretely, we set m={40,250,4000}m=\left\{40,250,4000\right\} for CIFAR-10, m={400,2500,10000}m=\left\{400,2500,10000\right\} for CIFAR-100, m={250,1000}m=\left\{250,1000\right\} for SVHN, m={40,250}m=\left\{40,250\right\} for STL-10 and m={100000}m=\left\{100000\right\} for ImageNet. Next, the selected samples are used for two SSL approaches, which are trained and evaluated on the datasets using the codebase Unified SSL Benchmark (USB) [52]. The optimizer for all experiments is standard stochastic gradient descent (SGD) with a momentum of 0.90.9 [44]. The initial learning rate is 0.030.03 with a learning rate decay of 0.00050.0005. We use ResNet-50 [16] for the ImageNet experiment and Wide ResNet-28-2 [62] for other datasets. Finally, we evaluate the performance with the Top-1 classification accuracy metric on the test set. Experiments are run on 8*NVIDIA Tesla A100 (40 GB) and 2*Intel 6248R 24-Core Processor. We average our results over five independent runs.

7.2 Comparison with Other Sampling Methods

Main results We apply RDSS on Flexmatch and Freematch to compare with the following three baselines and two SOTA methods in SSL under different annotation budget settings. The baselines conclude Stratified, Random and kk-Means, while the two SOTA methods are USL [50] and ActiveFT [57]. The results are shown on Table 1 from which we have several observations: (1) Our proposed RDSS achieves the highest accuracy, outperforming other sampling methods, which underscores the effectiveness of our approach; (2) USL attains suboptimal results under most budget settings yet exhibits a significant gap compared to RDSS, particularly under severely constrained ones. For instance, FreeMatch achieves a 4.95%4.95\% rise on the STL-10 with a budget of 4040; (3) In most experiments, RDSS either approaches or surpasses the performance of stratified sampling, especially on SVHN and STL-10. However, the stratified sampling method is practically infeasible given that the category labels of the data are not known a priori.

Results on ImageNet We also compare the second-best method USL with RDSS on ImageNet. Following the settings of FreeMatch [53], we select 100k samples for annotation. FreeMatch, using RDSS and USL as sampling methods, achieves 58.24%58.24\% and 56.86%56.86\% accuracy, respectively, demonstrating a substantial enhancement in the performance of our method over the USL approach.

Table 1: Comparison with other sampling methods. Due to stratified sampling limitations, the results are marked in grey. Top and second-best performances are bolded and underlined, respectively, excluding stratified sampling. Metrics represent mean accuracy and standard deviation over five independent runs.
Dataset CIFAR-10 CIFAR-100 SVHN STL-10
Budget 40 250 4000 400 2500 10000 250 1000 40 250
Applied to FlexMatch [63]
Stratified 91.45±\pm3.41 95.10±\pm0.25 95.63±\pm0.24 50.23±\pm0.41 67.38±\pm0.45 73.61±\pm0.43 89.60±\pm1.86 93.66±\pm0.49 75.33±\pm3.74 92.29±\pm0.64
Random 87.30±\pm4.61 93.95±\pm0.91 95.17±\pm0.59 45.58±\pm0.97 66.48±\pm0.98 72.61±\pm0.83 87.67±\pm1.16 94.06±\pm1.14 65.81±\pm1.21 90.70±\pm0.79
kk-Means 81.23±\pm8.71 94.59±\pm0.51 95.09±\pm0.65 41.60±\pm1.24 65.99±\pm0.57 71.53±\pm0.42 90.28±\pm0.69 93.82±\pm1.04 55.43±\pm0.39 90.64±\pm1.05
USL [50] 91.73±\pm0.13 94.89±\pm0.20 95.43±\pm0.15 46.89±\pm0.46 66.75±\pm0.37 72.53±\pm0.32 90.03±\pm0.63 93.10±\pm0.78 75.65±\pm0.60 90.77±\pm0.36
ActiveFT [57] 70.87±\pm4.14 93.85±\pm1.37 95.31±\pm0.75 25.69±\pm0.64 57.19±\pm2.06 70.96±\pm0.75 89.32±\pm1.87 92.53±\pm0.43 55.57±\pm1.42 87.28±\pm1.19
RDSS (Ours) 94.69±\pm0.28 95.21±\pm0.47 95.71±\pm0.10 48.12±\pm0.36 67.27±\pm0.55 73.21±\pm0.29 91.70±\pm0.39 95.70±\pm0.35 77.96±\pm0.52 93.16±\pm0.41
Applied to FreeMatch [53]
Stratified 95.05±\pm0.15 95.40±\pm0.23 95.80±\pm0.29 51.29±\pm0.56 67.69±\pm0.58 73.90±\pm0.53 92.58±\pm1.05 94.22±\pm0.78 79.16±\pm5.01 91.36±\pm0.18
Random 93.41±\pm1.24 93.98±\pm0.91 95.56±\pm0.17 47.16±\pm1.25 66.09±\pm1.08 72.09±\pm0.99 91.62±\pm1.88 94.40±\pm1.28 76.66±\pm2.43 90.72±\pm0.97
kk-Means 88.05±\pm5.07 94.80±\pm0.48 95.51±\pm0.37 44.07±\pm1.94 66.09±\pm0.39 71.69±\pm0.72 93.30±\pm0.46 94.68±\pm0.72 63.22±\pm4.92 89.99±\pm0.87
USL [50] 93.81±\pm0.62 95.19±\pm0.18 95.78±\pm0.29 47.07±\pm0.78 66.92±\pm0.33 72.59±\pm0.36 93.36±\pm0.53 94.44±\pm0.44 76.95±\pm0.86 90.58±\pm0.58
ActiveFT [57] 78.13±\pm2.87 94.54±\pm0.81 95.33±\pm0.53 26.67±\pm0.46 56.23±\pm0.85 71.20±\pm0.68 92.60±\pm0.51 93.71±\pm0.54 63.31±\pm2.99 86.60±\pm0.30
RDSS (Ours) 95.05±\pm0.13 95.50±\pm0.20 95.98±\pm0.28 48.41±\pm0.59 67.40±\pm0.23 73.13±\pm0.19 94.54±\pm0.46 95.83±\pm0.37 81.90±\pm1.72 92.22±\pm0.40

7.3 Comparison with AL/SSAL Approaches

First, we compare RDSS against various traditional AL approaches on CIFAR-10/100. AL approaches conclude CoreSet [36], VAAL [39], LearnLoss [60] and MCDAL [8]. For a fair comparison, we exclusively use samples selected by RDSS for supervised learning compared to other AL approaches, considering that AL relies solely on labelled samples for supervised learning. The implementation details are shown in Appendix D.5. The experimental results are presented in Table 2, from which we observe that RDSS achieves the highest accuracy under almost all budget settings when relying solely on labelled data for supervised learning, with notable improvements on CIFAR-100.

Second, we compare RDSS with sampling methods used in SSAL when applied to the same SSL framework (i.e., FlexMatch or FreeMatch) on CIFAR-10. The sampling methods conclude CoreSetSSL [36], MMA [42], CBSSAL [12], and TOD-Semi [17]. In detail, we tune recent SSAL approaches with their public implementations and run experiments under an extremely low-budget setting, i.e., 40 samples in a 20-random-and-20-selected setting. Table 3 illustrates that the performance of most SSAL approaches falls below that of random sampling methods under extremely low-budget settings. This inefficiency stems from the dependency of sample selection on model performance within the SSAL framework, which struggles when the model is weak. Our model-free method, in contrast, selects samples before training, avoiding these pitfalls.

Table 2: Comparison with AL approaches under Supervised Learning (SL) paradigm. The best performance is bold and the second best performance is underlined.
Dataset CIFAR-10 CIFAR-100
Budget 7500 10000 7500 10000
CoreSet 85.46 87.56 47.17 53.06
VAAL 86.82 88.97 47.02 53.99
LearnLoss 85.49 87.06 47.81 54.02
MCDAL 87.24 89.40 49.34 54.14
SL+RDSS (Ours) 87.18 89.77 50.13 56.04
Whole Dataset 95.62 78.83
Table 3: Comparison with SSAL approaches. The green (red) arrow represents the improvement (decrease) compared to the random sampling method.
Method FlexMatch FreeMatch
Stratified 91.45 95.05
Random 87.30 93.41
CoreSetSSL 87.66 0.36\uparrow 0.36 91.24 2.17\downarrow 2.17
MMA 74.61 12.69\downarrow 12.69 87.37 6.04\downarrow 6.04
CBSSAL 86.58 0.72\downarrow 0.72 91.68 1.73\downarrow 1.73
TOD-Semi 86.21 1.09\downarrow 1.09 90.77 2.64\downarrow 2.64
RDSS (Ours) 94.69 7.39\uparrow 7.39 95.05 1.64\uparrow 1.64

Third, we directly compare RDSS with the above AL/SSAL approaches when applied to SSL, which may better reflect the paradigm differences. The experimental results and analysis are in the Appendix D.6.

7.4 Trade-off Parameter α\alpha

We analyze the effect of different α\alpha with Freematch on CIFAR-10/100. The results are presented in Table 4, from which we have several observations: (1) Our proposed RDSS achieves the highest accuracy under all budget conditions, surpassing those that employ a fixed value; (2) The α\alpha that achieve the best or the second best performance are within the interval we set, which is in line with our theoretical derivation in Section 6; (3) The experimental outcomes exhibit varying degrees of reduction compared to our approach when the representativeness or diversity term is removed.

Table 4: Effect of different α\alpha. The grey results indicate that the α\alpha is outside the interval we set in Section 6, i.e., α<11/m\alpha<1-1/\sqrt{m}, while the black results indicate that the α\alpha is within the interval we set, i.e., 11/mα11-1/\sqrt{m}\leq\alpha\leq 1. Among them, α=0\alpha=0 and α=1\alpha=1 indicate the removal of the representativeness and diversity terms, respectively. The best performance is bold, and the second-best performance is underlined.
Dataset CIFAR-10 CIFAR-100
Budget (mm) 40 250 4000 400 2500 10000
0 85.54±\pm0.48 93.55±\pm0.34 94.58±\pm0.27 39.26±\pm0.52 63.77±\pm0.26 71.90±\pm0.17
0.40 92.28±\pm0.24 93.68±\pm0.13 94.95±\pm0.12 42.56±\pm0.47 65.88±\pm0.24 71.71±\pm0.29
0.80 94.42±\pm0.49 94.94±\pm0.37 95.15±\pm0.35 45.62±\pm0.35 66.87±\pm0.20 72.45±\pm0.23
0.90 94.33±\pm0.28 95.03±\pm0.21 95.20±\pm0.42 48.12±\pm0.50 67.14±\pm0.16 72.15±\pm0.23
0.95 94.44±\pm0.64 95.07±\pm0.26 95.45±\pm0.38 48.41±\pm0.59 67.11±\pm0.29 72.80±\pm0.35
0.98 94.51±\pm0.39 95.02±\pm0.15 95.31±\pm0.44 48.33±\pm0.54 67.40±\pm0.23 72.68±\pm0.22
1 94.53±\pm0.42 95.01±\pm0.23 95.54±\pm0.25 48.18±\pm0.36 67.20±\pm0.29 73.05±\pm0.18
11/m1-1/\sqrt{m} (Ours) 95.05±\pm0.13 95.50±\pm0.20 95.98±\pm0.28 48.41±\pm0.59 67.40±\pm0.23 73.13±\pm0.19

8 Conclusion

In this work, we propose a model-free sampling method, RDSS, to select a subset from unlabeled data for annotation in SSL. The primary innovation of our approach lies in the introduction of α\alpha-MMD, designed to evaluate the representativeness and diversity of selected samples. Under a low-budget setting, we develop a fast and efficient algorithm GKHR for this problem using the Frank-Wolfe algorithm. Both theoretical analyses and empirical experiments demonstrate the effectiveness of RDSS. In future research, we would like to apply our methodology to scenarios where labelling is cost-prohibitive, such as in the medical domain.

Acknowledgements

This research was partially supported by National Natural Science Foundation of China under grant No. 82202984, Zhejiang Key R&\&D Program of China under grants No. 2023C03053 and No. 2024SSYS0026, and US National Science Foundation under grant No. 2316011.

References

  • Ai et al. [2021] M. Ai, J. Yu, H. Zhang, and H. Wang. Optimal subsampling algorithms for big data regressions. Statistica Sinica, 31(2):749–772, 2021.
  • Aronszajn [1950] N. Aronszajn. Theory of reproducing kernels. Transactions of the American mathematical society, 68(3):337–404, 1950.
  • Bach [2021] F. Bach. Learning theory from first principles. Draft of a book, version of Sept, 6:2021, 2021.
  • Bach et al. [2012] F. Bach, S. Lacoste-Julien, and G. Obozinski. On the equivalence between herding and conditional gradient algorithms. arXiv preprint arXiv:1203.4523, 2012.
  • Bietti and Mairal [2019] A. Bietti and J. Mairal. Group invariance, stability to deformations, and complexity of deep convolutional representations. The Journal of Machine Learning Research, 20(1):876–924, 2019.
  • Chan et al. [2021] Y.-C. Chan, M. Li, and S. Oymak. On the marginal benefit of active learning: Does self-supervision eat its cake? In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3455–3459. IEEE, 2021.
  • Chen et al. [2012] Y. Chen, M. Welling, and A. Smola. Super-samples from kernel herding. arXiv preprint arXiv:1203.3472, 2012.
  • Cho et al. [2022] J. W. Cho, D.-J. Kim, Y. Jung, and I. S. Kweon. Mcdal: Maximum classifier discrepancy for active learning. IEEE transactions on neural networks and learning systems, 2022.
  • Coates et al. [2011] A. Coates, A. Ng, and H. Lee. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 215–223. JMLR Workshop and Conference Proceedings, 2011.
  • Deng et al. [2009] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  • Freytag et al. [2014] A. Freytag, E. Rodner, and J. Denzler. Selecting influential examples: Active learning with expected model output changes. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part IV 13, pages 562–577. Springer, 2014.
  • Gao et al. [2020] M. Gao, Z. Zhang, G. Yu, S. Ö. Arık, L. S. Davis, and T. Pfister. Consistency-based semi-supervised active learning: Towards minimizing labeling cost. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16, pages 510–526. Springer, 2020.
  • Graf and Luschgy [2007] S. Graf and H. Luschgy. Foundations of quantization for probability distributions. Springer, 2007.
  • Gretton et al. [2006] A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, and A. Smola. A kernel method for the two-sample-problem. Advances in neural information processing systems, 19, 2006.
  • Gretton et al. [2012] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723–773, 2012.
  • He et al. [2016] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • Huang et al. [2021] S. Huang, T. Wang, H. Xiong, J. Huan, and D. Dou. Semi-supervised active learning with temporal output discrepancy. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3447–3456, 2021.
  • Joshi et al. [2009] A. J. Joshi, F. Porikli, and N. Papanikolopoulos. Multi-class active learning for image classification. In 2009 ieee conference on computer vision and pattern recognition, pages 2372–2379. IEEE, 2009.
  • Krizhevsky et al. [2009] A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • Laine and Aila [2016] S. Laine and T. Aila. Temporal ensembling for semi-supervised learning. In International Conference on Learning Representations, 2016.
  • Lee et al. [2013] D.-H. Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, volume 3, page 896. Atlanta, 2013.
  • Lewis and Catlett [1994] D. D. Lewis and J. Catlett. Heterogeneous uncertainty sampling for supervised learning. In Machine learning proceedings 1994, pages 148–156. Elsevier, 1994.
  • Li et al. [2023] M. Li, R. Wu, H. Liu, J. Yu, X. Yang, B. Han, and T. Liu. Instant: Semi-supervised learning with instance-dependent thresholds. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • Liu et al. [2021] Z. Liu, H. Ding, H. Zhong, W. Li, J. Dai, and C. He. Influence selection for active learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9274–9283, 2021.
  • Lohr [2021] S. L. Lohr. Sampling: design and analysis. Chapman and Hall/CRC, 2021.
  • Luo et al. [2013] W. Luo, A. Schwing, and R. Urtasun. Latent structured active learning. Advances in Neural Information Processing Systems, 26, 2013.
  • Mahmood et al. [2021] R. Mahmood, S. Fidler, and M. T. Law. Low budget active learning via wasserstein distance: An integer programming approach. arXiv preprint arXiv:2106.02968, 2021.
  • Mak and Joseph [2018] S. Mak and V. R. Joseph. Support points. The Annals of Statistics, 46(6A):2562–2592, 2018.
  • Muandet et al. [2017] K. Muandet, K. Fukumizu, B. Sriperumbudur, B. Schölkopf, et al. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends® in Machine Learning, 10(1-2):1–141, 2017.
  • Netzer et al. [2011] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. 2011.
  • Paulsen and Raghupathi [2016] V. I. Paulsen and M. Raghupathi. An introduction to the theory of reproducing kernel Hilbert spaces, volume 152. Cambridge university press, 2016.
  • Pronzato [2021] L. Pronzato. Performance analysis of greedy algorithms for minimising a maximum mean discrepancy. arXiv preprint arXiv:2101.07564, 2021.
  • Radford et al. [2021] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • Sajjadi et al. [2016] M. Sajjadi, M. Javanmardi, and T. Tasdizen. Regularization with stochastic transformations and perturbations for deep semi-supervised learning. Advances in neural information processing systems, 29, 2016.
  • Schmutz et al. [2022] H. Schmutz, O. Humbert, and P.-A. Mattei. Don’t fear the unlabelled: safe semi-supervised learning via debiasing. In The Eleventh International Conference on Learning Representations, 2022.
  • Sener and Savarese [2018] O. Sener and S. Savarese. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations, 2018.
  • Shalit et al. [2017] U. Shalit, F. D. Johansson, and D. Sontag. Estimating individual treatment effect: generalization bounds and algorithms. In International conference on machine learning, pages 3076–3085. PMLR, 2017.
  • Shao et al. [2024] Q. Shao, K. Zhang, B. Du, Z. Li, Y. Wu, Q. Chen, J. Wu, and J. Chen. Comprehensive subset selection for ct volume compression to improve pulmonary disease screening efficiency. In Artificial Intelligence and Data Science for Healthcare: Bridging Data-Centric AI and People-Centric Healthcare, 2024.
  • Sinha et al. [2019] S. Sinha, S. Ebrahimi, and T. Darrell. Variational adversarial active learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5972–5981, 2019.
  • Sohn et al. [2020] K. Sohn, D. Berthelot, N. Carlini, Z. Zhang, H. Zhang, C. A. Raffel, E. D. Cubuk, A. Kurakin, and C.-L. Li. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Advances in neural information processing systems, 33:596–608, 2020.
  • Song et al. [2009] L. Song, J. Huang, A. Smola, and K. Fukumizu. Hilbert space embeddings of conditional distributions with applications to dynamical systems. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 961–968, 2009.
  • Song et al. [2019] S. Song, D. Berthelot, and A. Rostamizadeh. Combining mixmatch and active learning for better accuracy with fewer labels. arXiv preprint arXiv:1912.00594, 2019.
  • Sriperumbudur et al. [2012] B. K. Sriperumbudur, K. Fukumizu, A. Gretton, B. Schölkopf, and G. R. Lanckriet. On the empirical estimation of integral probability metrics. 2012.
  • Sutskever et al. [2013] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139–1147. PMLR, 2013.
  • Tarvainen and Valpola [2017] A. Tarvainen and H. Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in neural information processing systems, 30, 2017.
  • Thompson [2012] S. K. Thompson. Sampling, volume 755. John Wiley & Sons, 2012.
  • Tong and Koller [2001] S. Tong and D. Koller. Support vector machine active learning with applications to text classification. Journal of machine learning research, 2(Nov):45–66, 2001.
  • Wainwright [2019] M. J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019.
  • Wang et al. [2016] K. Wang, D. Zhang, Y. Li, R. Zhang, and L. Lin. Cost-effective active learning for deep image classification. IEEE Transactions on Circuits and Systems for Video Technology, 27(12):2591–2600, 2016.
  • Wang et al. [2022a] X. Wang, L. Lian, and S. X. Yu. Unsupervised selective labeling for more effective semi-supervised learning. In European Conference on Computer Vision, pages 427–445. Springer, 2022a.
  • Wang et al. [2022b] X. Wang, Z. Wu, L. Lian, and S. X. Yu. Debiased learning from naturally imbalanced pseudo-labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14647–14657, 2022b.
  • Wang et al. [2022c] Y. Wang, H. Chen, Y. Fan, W. Sun, R. Tao, W. Hou, R. Wang, L. Yang, Z. Zhou, L.-Z. Guo, et al. Usb: A unified semi-supervised learning benchmark for classification. Advances in Neural Information Processing Systems, 35:3938–3961, 2022c.
  • Wang et al. [2022d] Y. Wang, H. Chen, Q. Heng, W. Hou, Y. Fan, Z. Wu, J. Wang, M. Savvides, T. Shinozaki, B. Raj, et al. Freematch: Self-adaptive thresholding for semi-supervised learning. arXiv preprint arXiv:2205.07246, 2022d.
  • Wu et al. [2023] X. Wu, Y. Huo, H. Ren, and C. Zou. Optimal subsampling via predictive inference. Journal of the American Statistical Association, (just-accepted):1–29, 2023.
  • Xie et al. [2020a] Q. Xie, Z. Dai, E. Hovy, T. Luong, and Q. Le. Unsupervised data augmentation for consistency training. Advances in neural information processing systems, 33:6256–6268, 2020a.
  • Xie et al. [2020b] Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10687–10698, 2020b.
  • Xie et al. [2023] Y. Xie, H. Lu, J. Yan, X. Yang, M. Tomizuka, and W. Zhan. Active finetuning: Exploiting annotation budget in the pretraining-finetuning paradigm. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23715–23724, 2023.
  • Xu et al. [2024] Y. Xu, D. Zhang, S. Zhang, S. Wu, Z. Feng, and G. Chen. Predictive and near-optimal sampling for view materialization in video databases. Proceedings of the ACM on Management of Data, 2(1):1–27, 2024.
  • Yang et al. [2023] L. Yang, Z. Zhao, L. Qi, Y. Qiao, Y. Shi, and H. Zhao. Shrinking class space for enhanced certainty in semi-supervised learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16187–16196, 2023.
  • Yoo and Kweon [2019] D. Yoo and I. S. Kweon. Learning loss for active learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 93–102, 2019.
  • Yu et al. [2022] J. Yu, H. Wang, M. Ai, and H. Zhang. Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data. Journal of the American Statistical Association, 117(537):265–276, 2022.
  • Zagoruyko and Komodakis [2016] S. Zagoruyko and N. Komodakis. Wide residual networks. In Procedings of the British Machine Vision Conference 2016. British Machine Vision Association, 2016.
  • Zhang et al. [2021] B. Zhang, Y. Wang, W. Hou, H. Wu, J. Wang, M. Okumura, and T. Shinozaki. Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling. Advances in Neural Information Processing Systems, 34:18408–18419, 2021.
  • Zhang and Chen [2021] H. Zhang and S. X. Chen. Concentration inequalities for statistical inference. Communications in Mathematical Research, 37(1):1–85, 2021.
  • Zhang et al. [2023a] J. Zhang, C. Meng, J. Yu, M. Zhang, W. Zhong, and P. Ma. An optimal transport approach for selecting a representative subsample with application in efficient kernel density estimation. Journal of Computational and Graphical Statistics, 32(1):329–339, 2023a.
  • Zhang et al. [2023b] M. Zhang, Y. Zhou, Z. Zhou, and A. Zhang. Model-free subsampling method based on uniform designs. IEEE Transactions on Knowledge and Data Engineering, 2023b.

Appendix A Algorithms

A.1 Derivation of Generalized Kernel Herding (GKH)

Proof.

The proof technique is borrowed from [32]. Let us firstly define a weighted modification of α\alpha-MMD. For any 𝐰n\mathbf{w}\in\mathbb{R}^{n} such that 𝐰𝟏=1\mathbf{w}^{\top}\mathbf{1}=1, the weighted α\alpha-MMD is defined by

MMDk,α,𝐗n2(𝐰)=𝐰𝐊𝐰2α𝐰𝐩+α2K¯,\text{MMD}_{k,\alpha,\mathbf{X}_{n}}^{2}(\mathbf{w})=\mathbf{w}^{\top}\mathbf{K}\mathbf{w}-2\alpha\mathbf{w}^{\top}\mathbf{p}+\alpha^{2}\overline{K},

where 𝐊=[k(𝐱i,𝐱j)]1i,jn\mathbf{K}=[k(\mathbf{x}_{i},\mathbf{x}_{j})]_{1\leq i,j\leq n}, K¯=𝟏𝐊𝟏/n2\overline{K}=\mathbf{1}^{\top}\mathbf{K}\mathbf{1}/n^{2}, 𝐩=(𝐞1𝐊𝟏/n,,𝐞n𝐊𝟏/n)\mathbf{p}=(\mathbf{e}_{1}^{\top}\mathbf{K}\mathbf{1}/n,\cdots,\mathbf{e}_{n}^{\top}\mathbf{K}\mathbf{1}/n), {𝐞i}i=1n\{\mathbf{e}_{i}\}_{i=1}^{n} is the set of standard basis of n\mathbb{R}^{n}. It is obvious that for any p[n]\mathcal{I}_{p}\subset[n],

MMDk,α,𝐗n2(𝐰p)=MMDk,α2(𝐗p,𝐗n),\text{MMD}_{k,\alpha,\mathbf{X}_{n}}^{2}(\mathbf{w}_{p})=\text{MMD}_{k,\alpha}^{2}(\mathbf{X}_{\mathcal{I}_{p}},\mathbf{X}_{n}),

where (𝐰p)i=1/p(\mathbf{w}_{p})_{i}=1/p if ipi\in\mathcal{I}_{p}, and (𝐰p)i=0(\mathbf{w}_{p})_{i}=0 if not. Therefore, weighted α\alpha-MMD is indeed a generalization of α\alpha-MMD. Let

𝐊=𝐊2α𝐩𝟏+α2K¯𝟏𝟏\mathbf{K}_{*}=\mathbf{K}-2\alpha\mathbf{p}\mathbf{1}^{\top}+\alpha^{2}\overline{K}\mathbf{1}\mathbf{1}^{\top}

we obtain the quadratic form expression of weighted α\alpha-MMD by MMDk,α,𝐗n2(𝐰)=𝐰𝐊𝐰\text{MMD}^{2}_{k,\alpha,\mathbf{X}_{n}}(\mathbf{w})=\mathbf{w}^{\top}\mathbf{K}_{*}\mathbf{w}, where 𝐊\mathbf{K}_{*} is strictly positive definite if 𝐰𝐰n\mathbf{w}\not=\mathbf{w}_{n} and kk is a characteristic kernel according to [32]. Recall our low-budget setting and choice of kernel, 𝐊\mathbf{K}_{*} is indeed a strictly positive definite matrix. Thus MMDk,α,𝐗n2\text{MMD}^{2}_{k,\alpha,\mathbf{X}_{n}} is a convex functional w.r.t. 𝐰\mathbf{w}, leading to the fact that min𝐰𝟏=1MMDk,α,𝐗n2(𝐰)\min_{\mathbf{w}^{\top}\mathbf{1}=1}\text{MMD}_{k,\alpha,\mathbf{X}_{n}}^{2}(\mathbf{w}) can be solved by Frank-Wolfe algorithm. Then for 1p<n1\leq p<n,

𝐬pargmin𝐬𝟏=1𝐬(𝐊𝐰pα𝐩)=argmin𝐞i,i[n]𝐞i(𝐊𝐰pα𝐩).\mathbf{s}_{p}\in\mathop{\arg\min}_{\mathbf{s}^{\top}\mathbf{1}=1}\mathbf{s}^{\top}(\mathbf{Kw}_{p}-\alpha\mathbf{p})=\mathop{\arg\min}_{\mathbf{e}_{i},i\in[n]}\mathbf{e}_{i}^{\top}(\mathbf{Kw}_{p}-\alpha\mathbf{p}).

Let 𝐞ip=𝐬p\mathbf{e}_{i_{p}}=\mathbf{s}_{p}, under uniform step size, we have

𝐰p+1=(pp+1)𝐰p+1p+1𝐞ip\mathbf{w}_{p+1}=\left(\frac{p}{p+1}\right)\mathbf{w}_{p}+\frac{1}{p+1}\mathbf{e}_{i_{p}}

as the update formula of Frank-Wolfe algorithm, which is equivalent to

ipargmini[n]jmk(𝐱i,𝐱j)αpl=1nk(𝐱i,𝐱l).i^{*}_{p}\in\arg\min_{i\in[n]}\sum_{j\in\mathcal{I}_{m}}k(\mathbf{x}_{i},\mathbf{x}_{j})-\alpha p\sum_{l=1}^{n}k(\mathbf{x}_{i},\mathbf{x}_{l}).

Set 𝐰0=0\mathbf{w}_{0}=0, we immediately derive the iterating formula in (7). ∎

A.2 Pseudo Codes

Algorithm 2 Generalized Kernel Herding
0:  Data set 𝐗n={𝐱1,,𝐱n}𝒳\mathbf{X}_{n}=\{\mathbf{x}_{1},\cdots,\mathbf{x}_{n}\}\subset\mathcal{X}; the number of selected samples m<nm<n; a positive definite, characteristic and radial kernel k(,)k(\cdot,\cdot) on 𝒳×𝒳\mathcal{X}\times\mathcal{X}; trade-off parameter α1\alpha\leq 1.
0:  selected samples 𝐗m={𝐱i1,,𝐱im}\mathbf{X}_{\mathcal{I}^{*}_{m}}=\{\mathbf{x}_{i^{*}_{1}},\cdots,\mathbf{x}_{i^{*}_{m}}\}.
1:  For each 𝐱i𝐗n\mathbf{x}_{i}\in\mathbf{X}_{n} calculate μ(𝐱i):=j=1nk(𝐱j,𝐱i)/n\mu(\mathbf{x}_{i}):=\sum_{j=1}^{n}k(\mathbf{x}_{j},\mathbf{x}_{i})/n.
2:  Set β1=1\beta_{1}=1, S0=0S_{0}=0, =\mathcal{I}=\emptyset.
3:  for p{1,,m}p\in\{1,\cdots,m\} do
4:     ipargmini[n]Sp1(𝐱i)αμ(𝐱i){i^{*}_{p}}\in\mathop{\arg\min}_{i\in[n]}S_{p-1}(\mathbf{x}_{i})-\alpha\mu(\mathbf{x}_{i})
5:     For all i[n]i\in[n], update Sp(𝐱i)=(1βp)Sp1(𝐱i)+βpk(𝐱ip,𝐱i)S_{p}(\mathbf{x}_{i})=(1-\beta_{p})S_{p-1}(\mathbf{x}_{i})+\beta_{p}k(\mathbf{x}_{i^{*}_{p}},\mathbf{x}_{i})
6:     p+1p{ip}{\mathcal{I}^{*}_{p+1}}\leftarrow{\mathcal{I}^{*}_{p}}\cup\{{i^{*}_{p}}\}, pp+1p\leftarrow p+1, set βp=1/p\beta_{p}=1/p.
7:  end for

Appendix B Technical Lemmas

Lemma B.1 (Lemma 2 [32]).

Let (tk)k\left(t_{k}\right)_{k} and (αk)k\left(\alpha_{k}\right)_{k} be two real positive sequences and AA be a strictly positive real. If tkt_{k} satisfies

t1A and tk+1(1αk+1)tk+Aαk+12,k1,t_{1}\leq A\text{ and }t_{k+1}\leq\left(1-\alpha_{k+1}\right)t_{k}+A\alpha_{k+1}^{2},k\geq 1,

with αk=1/k\alpha_{k}=1/k for all kk, then tk<A(2+logk)/(k+1)t_{k}<A(2+\log k)/(k+1) for all k>1k>1.

Lemma B.2.

The selected samples 𝐗m\mathbf{X}_{\mathcal{I}^{*}_{m}} generated by GKH (Algorithm 2) satisfies

MMDk,α2(𝐗m,𝐗n)Mα2+B2+logmm+1\operatorname{MMD}_{k,\alpha}^{2}\left(\mathbf{X}_{\mathcal{I}^{*}_{m}},\mathbf{X}_{n}\right)\leq M_{\alpha}^{2}+B\frac{2+\log m}{m+1} (10)

where B=2KB=2K, 0max𝐱𝒳k(𝐱,𝐱)K0\leq\max_{\mathbf{x}\in\mathcal{X}}k(\mathbf{x},\mathbf{x})\leq K, Mα2M_{\alpha}^{2} is defined by

Mα2:=min𝐰𝟏=1,𝐰0MMDk,α,𝐗n2(𝐰)M_{\alpha}^{2}:=\min_{\mathbf{w}^{\top}\mathbf{1}=1,\mathbf{w}\geq 0}\operatorname{MMD}_{k,\alpha,\mathbf{X}_{n}}^{2}\left({\mathbf{w}}\right)
Proof.

Following the notations in Appendix A, let 𝐩α=α𝐩\mathbf{p}_{\alpha}=\alpha\mathbf{p}, we could straightly follow the proof for finite-sample-size error bound of kernel herding with predefined step sizes given by [32] to derive Lemma B.2, without any other technique. The detailed proof is omitted. ∎

Lemma B.3.

Let \mathcal{H} be an RKHS over 𝒳\mathcal{X} associated with positive definite kernel kk, and 0max𝐱𝒳k(𝐱,𝐱)K0\leq\max_{\mathbf{x}\in\mathcal{X}}k(\mathbf{x},\mathbf{x})\leq K. Let 𝐗m={𝐱i}i=1m\mathbf{X}_{m}=\{\mathbf{x}_{i}\}_{i=1}^{m}, 𝐘n={𝐲j}j=1m\mathbf{Y}_{n}=\{\mathbf{y}_{j}\}_{j=1}^{m}, 𝐱i,𝐲j𝒳\mathbf{x}_{i},\mathbf{y}_{j}\in\mathcal{X}. Then for any α1\alpha\leq 1,

|MMDk,α(𝐗m,𝐘n)MMDk(𝐗m,𝐘n)|(1α)K|\operatorname{MMD}_{k,\alpha}(\mathbf{X}_{m},\mathbf{Y}_{n})-\operatorname{MMD}_{k}(\mathbf{X}_{m},\mathbf{Y}_{n})|\leq(1-\alpha)\sqrt{K}
Proof.
|MMDk,α(𝐗m,𝐘n)MMDk(𝐗m,𝐘n)|\displaystyle\left|\operatorname{MMD}_{k,\alpha}(\mathbf{X}_{m},\mathbf{Y}_{n})-\operatorname{MMD}_{k}(\mathbf{X}_{m},\mathbf{Y}_{n})\right|
=\displaystyle= |supf1(1mi=1mf(𝐱i)αnj=1nf(𝐲j))supf1(1mi=1mf(𝐱i)1nj=1nf(𝐲j))|\displaystyle\left|\sup_{\|f\|_{\mathcal{H}}\leq 1}\left(\frac{1}{m}\sum_{i=1}^{m}f\left(\mathbf{x}_{i}\right)-\frac{\alpha}{n}\sum_{j=1}^{n}f\left(\mathbf{y}_{j}\right)\right)-\sup_{\|f\|_{\mathcal{H}}\leq 1}\left(\frac{1}{m}\sum_{i=1}^{m}f\left(\mathbf{x}_{i}\right)-\frac{1}{n}\sum_{j=1}^{n}f\left(\mathbf{y}_{j}\right)\right)\right|
\displaystyle\leq supf1|1αni=1nf(yi)|=(1αn)supf1|i=1nf(yi)|\displaystyle\sup_{\|f\|_{\mathcal{H}}\leq 1}\left|\frac{1-\alpha}{n}\sum_{i=1}^{n}f\left(y_{i}\right)\right|=\left(\frac{1-\alpha}{n}\right)\sup_{\|f\|_{\mathcal{H}}\leq 1}\left|\sum_{i=1}^{n}f\left(y_{i}\right)\right|
=\displaystyle= (1αn)supf1|j=1nf,k(,𝐲j)|(1αn)supf1j=1n|f,k(,𝐲j)|\displaystyle\left(\frac{1-\alpha}{n}\right)\sup_{\|f\|_{\mathcal{H}}\leq 1}\left|\sum_{j=1}^{n}\left\langle f,k(\cdot,\mathbf{y}_{j})\right\rangle_{\mathcal{H}}\right|\leq\left(\frac{1-\alpha}{n}\right)\sup_{\|f\|_{\mathcal{H}}\leq 1}\sum_{j=1}^{n}\left|\left\langle f,k(\cdot,\mathbf{y}_{j})\right\rangle_{\mathcal{H}}\right|
\displaystyle\leq (1αn)supf1j=1nfk(,𝐲j)(1α)K.\displaystyle\left(\frac{1-\alpha}{n}\right)\sup_{\|f\|_{\mathcal{H}}\leq 1}\sum_{j=1}^{n}\|f\|_{\mathcal{H}}\|k(\cdot,\mathbf{y}_{j})\|_{\mathcal{H}}\leq(1-\alpha)\sqrt{K}.

Lemma B.4 (Proposition 12.31 [48]).

Suppose that 1\mathcal{H}_{1} and 2\mathcal{H}_{2} are reproducing kernel Hilbert spaces of real-valued functions with domains 𝒳1\mathcal{X}_{1} and 𝒳2\mathcal{X}_{2}, and equipped with kernels k1k_{1} and k2k_{2}, respectively. Then the tensor product space =12\mathcal{H}=\mathcal{H}_{1}\otimes\mathcal{H}_{2} is an RKHS of real-valued functions with domain 𝒳1×𝒳2\mathcal{X}_{1}\times\mathcal{X}_{2}, and with kernel function

k((x1,x2),(x1,x2))=k1(x1,x1)k2(x2,x2).k\left(\left(x_{1},x_{2}\right),\left(x_{1}^{\prime},x_{2}^{\prime}\right)\right)=k_{1}\left(x_{1},x_{1}^{\prime}\right)k_{2}\left(x_{2},x_{2}^{\prime}\right).
Lemma B.5 (Theorem 5.7 [31]).

Let f1f\in\mathcal{H}_{1} and g2g\in\mathcal{H}_{2}, where 1,2\mathcal{H}_{1},\mathcal{H}_{2} be two RKHS containing real-valued functions on 𝒳\mathcal{X}, which is associated with positive definite kernel k1,k2k_{1},k_{2} and canonical feature map ϕ1,ϕ2\phi_{1},\phi_{2}, then for any x𝒳x\in\mathcal{X},

f(x)+g(x)=f,ϕ1(x)1+g,ϕ2(x)2=f+g,(ϕ1+ϕ2)(x)1+2,f(x)+g(x)=\left\langle f,\phi_{1}(x)\right\rangle_{\mathcal{H}_{1}}+\left\langle g,\phi_{2}(x)\right\rangle_{\mathcal{H}_{2}}=\left\langle f+g,(\phi_{1}+\phi_{2})(x)\right\rangle_{\mathcal{H}_{1}+\mathcal{H}_{2}},

where

1+2={f1+f2|fii}\mathcal{H}_{1}+\mathcal{H}_{2}=\{f_{1}+f_{2}|f_{i}\in\mathcal{H}_{i}\}

and ϕ1+ϕ2\phi_{1}+\phi_{2} is the canonical feature map of 1+2\mathcal{H}_{1}+\mathcal{H}_{2}. Furthermore,

f+g1+22f12+g22.\|f+g\|_{\mathcal{H}_{1}+\mathcal{H}_{2}}^{2}\leq\|f\|_{\mathcal{H}_{1}}^{2}+\|g\|_{\mathcal{H}_{2}}^{2}.
Lemma B.6.

For any unlabeled dataset 𝐗n𝒳\mathbf{X}_{n}\subset\mathcal{X} and any subset 𝐗m\mathbf{X}_{\mathcal{I}_{m}},

MMDk,α2(𝐗n,𝐗n)=(1α)2K¯,MMDk,α2(𝐗m,𝐗n)(1+α2)K,\operatorname{MMD}_{k,\alpha}^{2}(\mathbf{X}_{n},\mathbf{X}_{n})=(1-\alpha)^{2}\overline{K},\operatorname{MMD}_{k,\alpha}^{2}(\mathbf{X}_{\mathcal{I}_{m}},\mathbf{X}_{n})\leq(1+\alpha^{2})K,

where K¯=i=1nj=1nk(𝐱i,𝐱j)/n2\overline{K}=\sum_{i=1}^{n}\sum_{j=1}^{n}k(\mathbf{x}_{i},\mathbf{x}_{j})/n^{2}, K=max𝐱𝒳k(𝐱,𝐱)K=\max_{\mathbf{x}\in\mathcal{X}}k(\mathbf{x},\mathbf{x}).

Lemma B.6 is directly derived from the definition of α\alpha-MMD.

Appendix C Proof of Theorems

Proof for Theorem 5.4.

The proof borrows the technique introduced in [37] for decomposing the expected risk of hypotheses.

Firstly, let us denote that 4=11+12+3\mathcal{H}_{4}=\mathcal{H}_{1}\otimes\mathcal{H}_{1}+\mathcal{H}_{1}\otimes\mathcal{H}_{2}+\mathcal{H}_{3}, with kernel k4=k12+k1k2+k3k_{4}=k_{1}^{2}+k_{1}k_{2}+k_{3} and canonical feature map ϕ4=ϕ1ϕ1+ϕ1ϕ2+ϕ3\phi_{4}=\phi_{1}\otimes\phi_{1}+\phi_{1}\otimes\phi_{2}+\phi_{3}.

Under the assumptions in Theorem 5.4, according to Theorem 4 in [41], we have for any 𝐱𝒳\mathbf{x}\in\mathcal{X},

h(𝐱)=h,ϕ1(𝐱)1,𝔼[Y|𝐱]=𝔼[Y|X],ϕ2(𝐱)2,h(\mathbf{x})=\left\langle h,\phi_{1}(\mathbf{x})\right\rangle_{\mathcal{H}_{1}},\mathbb{E}[Y|\mathbf{x}]=\left\langle\mathbb{E}[Y|X],\phi_{2}(\mathbf{x})\right\rangle_{\mathcal{H}_{2}},
Var(Y|𝐱)=Var(Y|X),ϕ3(𝐱)3\operatorname{Var}(Y|\mathbf{x})=\left\langle\operatorname{Var}(Y|X),\phi_{3}(\mathbf{x})\right\rangle_{\mathcal{H}_{3}}

where ϕ1,ϕ2,ϕ3\phi_{1},\phi_{2},\phi_{3} are canonical feature maps in 1,2,3\mathcal{H}_{1},\mathcal{H}_{2},\mathcal{H}_{3}. Denote that m=𝔼[Y|X]m=\mathbb{E}[Y|X] and s=Var(Y|X)s=\operatorname{Var}(Y|X). Now by definition,

R(h)=𝔼[(h(𝐱),y)]=𝒳𝒴(h(𝐱),y)p(y|𝐱)p(𝐱)𝑑𝐱𝑑y=𝒳f(𝐱)p(𝐱)𝑑𝐱R(h)=\mathbb{E}\left[\ell(h(\mathbf{x}),y)\right]=\int_{\mathcal{X}}\int_{\mathcal{Y}}\ell(h(\mathbf{x}),y)p(y|\mathbf{x})p(\mathbf{x})d\mathbf{x}dy=\int_{\mathcal{X}}f(\mathbf{x})p(\mathbf{x})d\mathbf{x}

where

f(x)\displaystyle f(x) =𝒴(yh(𝐱))2p(y|𝐱)𝑑y\displaystyle=\int_{\mathcal{Y}}(y-h(\mathbf{x}))^{2}p(y|\mathbf{x})dy
=Var(Y|𝐱)2h(𝐱)𝔼[Y|𝐱]+h2(𝐱)\displaystyle=\operatorname{Var}(Y|\mathbf{x})-2h(\mathbf{x})\mathbb{E}[Y|\mathbf{x}]+h^{2}(\mathbf{x})
=s,ϕ3(𝐱)32h,ϕ1(𝐱)1m,ϕ2(𝐱)2+h,ϕ1(𝐱)1h,ϕ1(𝐱)1\displaystyle=\left\langle s,\phi_{3}(\mathbf{x})\right\rangle_{\mathcal{H}_{3}}-2\left\langle h,\phi_{1}(\mathbf{x})\right\rangle_{\mathcal{H}_{1}}\left\langle m,\phi_{2}(\mathbf{x})\right\rangle_{\mathcal{H}_{2}}+\left\langle h,\phi_{1}(\mathbf{x})\right\rangle_{\mathcal{H}_{1}}\left\langle h,\phi_{1}(\mathbf{x})\right\rangle_{\mathcal{H}_{1}}
=s,ϕ3(𝐱)32hm,(ϕ1ϕ2)(𝐱)12+hh,(ϕ1ϕ1)(𝐱)11\displaystyle=\left\langle s,\phi_{3}(\mathbf{x})\right\rangle_{\mathcal{H}_{3}}-\left\langle 2h\otimes m,(\phi_{1}\otimes\phi_{2})(\mathbf{x})\right\rangle_{\mathcal{H}_{1}\otimes\mathcal{H}_{2}}+\left\langle h\otimes h,(\phi_{1}\otimes\phi_{1})(\mathbf{x})\right\rangle_{\mathcal{H}_{1}\otimes\mathcal{H}_{1}}
=s2hm+hh,ϕ4(x)4\displaystyle=\left\langle s-2h\otimes m+h\otimes h,\phi_{4}(x)\right\rangle_{\mathcal{H}_{4}}

where the fourth equality holds by Lemma B.4 and the last equality holds by Lemma B.5, then f4f\in\mathcal{H}_{4}, and

f4\displaystyle\|f\|_{\mathcal{H}_{4}} =s2hm+hh4\displaystyle=\|s-2h\otimes m+h\otimes h\|_{\mathcal{H}_{4}}
s4+2hm4+hh4\displaystyle\leq\|s\|_{\mathcal{H}_{4}}+\|2h\otimes m\|_{\mathcal{H}_{4}}+\|h\otimes h\|_{\mathcal{H}_{4}}
s3+2m2h1+hh11\displaystyle\leq\|s\|_{\mathcal{H}_{3}}+2\|m\|_{\mathcal{H}_{2}}\|h\|_{\mathcal{H}_{1}}+\|h\otimes h\|_{\mathcal{H}_{1}\otimes\mathcal{H}_{1}}
=s3+2m2h1+h12\displaystyle=\|s\|_{\mathcal{H}_{3}}+2\|m\|_{\mathcal{H}_{2}}\|h\|_{\mathcal{H}_{1}}+\|h\|_{\mathcal{H}_{1}}^{2}
Kh2+2KhKm+Ks\displaystyle\leq K_{h}^{2}+2K_{h}K_{m}+K_{s}

where the second inequality holds by Lemma B.5. Therefore, let β=1/(Kh2+2KhKm+Ks)\beta=1/(K_{h}^{2}+2K_{h}K_{m}+K_{s}) we have βf4=βf41\|\beta f\|_{\mathcal{H}_{4}}=\beta\|f\|_{\mathcal{H}_{4}}\leq 1. Then

|R^T(h)R^S(h)|\displaystyle\left|\widehat{R}_{T}(h)-\widehat{R}_{S}(h)\right|
=\displaystyle= |𝒳f(𝐱)𝑑PT(𝐱)𝒳f(𝐱)𝑑PS(𝐱)|\displaystyle\left|\int_{\mathcal{X}}f(\mathbf{x})dP_{T}(\mathbf{x})-\int_{\mathcal{X}}f(\mathbf{x})dP_{S}(\mathbf{x})\right|
=\displaystyle= (Kh2+2KhKm+Ks)|𝒳βf(𝐱)𝑑PT(𝐱)𝒳βf(𝐱)𝑑PS(𝐱)|\displaystyle(K_{h}^{2}+2K_{h}K_{m}+K_{s})\left|\int_{\mathcal{X}}\beta f(\mathbf{x})dP_{T}(\mathbf{x})-\int_{\mathcal{X}}\beta f(\mathbf{x})dP_{S}(\mathbf{x})\right|
\displaystyle\leq (Kh2+2KhKm+Ks)supf41|𝒳f(𝐱)𝑑PT(𝐱)𝒳f(𝐱)𝑑PS(𝐱)|\displaystyle(K_{h}^{2}+2K_{h}K_{m}+K_{s})\sup_{\|f\|_{\mathcal{H}_{4}}\leq 1}\left|\int_{\mathcal{X}}f(\mathbf{x})dP_{T}(\mathbf{x})-\int_{\mathcal{X}}f(\mathbf{x})dP_{S}(\mathbf{x})\right|
=\displaystyle= (Kh2+2KhKm+Ks)MMDk4(𝐗S,𝐗T)\displaystyle(K_{h}^{2}+2K_{h}K_{m}+K_{s})\operatorname{MMD}_{k_{4}}(\mathbf{X}_{S},\mathbf{X}_{T})

where PTP_{T} denotes the empirical distribution constructed by 𝐗T\mathbf{X}_{T}, so does PSP_{S}. Recall Lemma B.3, we have Theorem 5.4. ∎

Proof for Theorem 5.6.

Following the notations in Appendix A, we further define

𝐰=𝟏/n,Cα2=MMDk,α,𝐗n2(𝐰)=(1α)2K¯\mathbf{w}_{*}=\mathbf{1}/n,C^{2}_{\alpha}=\operatorname{MMD}^{2}_{k,\alpha,\mathbf{X}_{n}}(\mathbf{w}_{*})=(1-\alpha)^{2}\overline{K} (11)
𝐰^=argmin𝟏𝐰=1MMDk,α,𝐗n2(𝐰)=α(𝐊1𝐊1𝟏𝟏𝐊1𝟏𝐊1𝟏)𝐩+𝐊1𝟏𝟏𝐊1𝟏\widehat{\mathbf{w}}=\mathop{\arg\min}_{\mathbf{1}^{\top}\mathbf{w}=1}\operatorname{MMD}^{2}_{k,\alpha,\mathbf{X}_{n}}(\mathbf{w})=\alpha\left(\mathbf{K}^{-1}-\frac{\mathbf{K}^{-1}\mathbf{1}\mathbf{1}^{\top}\mathbf{K}^{-1}}{\mathbf{1}^{\top}\mathbf{K}^{-1}\mathbf{1}}\right)\mathbf{p}+\frac{\mathbf{K}^{-1}\mathbf{1}}{\mathbf{1}^{\top}\mathbf{K}^{-1}\mathbf{1}}

Let 𝐩α=α𝐩\mathbf{p}_{\alpha}=\alpha\mathbf{p}, we have (𝐩α𝐊𝐰^)𝟏(\mathbf{p}_{\alpha}-\mathbf{K}\widehat{\mathbf{w}})\propto\mathbf{1}. Define

Δα(𝐰):=MMDk,α,𝐗n2(𝐰)Cα2=g^(𝐰)g^(𝐰)\Delta_{\alpha}(\mathbf{w}):=\operatorname{MMD}^{2}_{k,\alpha,\mathbf{X}_{n}}(\mathbf{w})-C_{\alpha}^{2}=\widehat{g}(\mathbf{w})-\widehat{g}(\mathbf{w}_{*})

where g^(𝐰)=(𝐰𝐰^)𝐊(𝐰𝐰^)\widehat{g}(\mathbf{w})=\left(\mathbf{w}-\widehat{\mathbf{w}}\right)^{\top}\mathbf{K}\left(\mathbf{w}-\widehat{\mathbf{w}}\right). The related details for proving the equality are omitted, since they are completely given by the proof of alternative expression of MMD in Pronzato [32]. By the convexity of g^()\widehat{g}(\cdot), for j=argmini[n]\pfp(𝐱i)j=\mathop{\arg\min}_{i\in[n]\backslash\mathcal{I}^{*}_{p}}f_{\mathcal{I}^{*}_{p}}(\mathbf{x}_{i}),

g^(𝐰)g^(𝐰p)+2(𝐰𝐰p)𝐊(𝐰p𝐰^)g^(𝐰p)+2minj[n]\p(𝐞j𝐰p)𝐊(𝐰p𝐰^)\widehat{g}\left(\mathbf{w}_{*}\right)\geq\widehat{g}\left(\mathbf{w}_{p}\right)+2\left(\mathbf{w}_{*}-\mathbf{w}_{p}\right)^{\top}\mathbf{K}\left(\mathbf{w}_{p}-\widehat{\mathbf{w}}\right)\geq\widehat{g}\left(\mathbf{w}_{p}\right)+2\min_{j\in[n]\backslash\mathcal{I}^{*}_{p}}\left(\mathbf{e}_{j}-\mathbf{w}_{p}\right)^{\top}\mathbf{K}\left(\mathbf{w}_{p}-\widehat{\mathbf{w}}\right)

where the second inequality holds with the assumption in Theorem 5.6

(𝐰𝐞j)𝐊(𝐰p𝐰^)\displaystyle\left(\mathbf{w}_{*}-\mathbf{e}_{j}\right)^{\top}\mathbf{K}\left(\mathbf{w}_{p}-\widehat{\mathbf{w}}\right) =(𝐰𝐞j)(𝐊𝐰p𝐩α)\displaystyle=\left(\mathbf{w}_{*}-\mathbf{e}_{j}\right)^{\top}\left(\mathbf{K}\mathbf{w}_{p}-\mathbf{p}_{\alpha}\right)
=i=1nfp(𝐱i)nfp(𝐱jp+1)i=1nfp(𝐱i)nfp(𝐱j)0\displaystyle=\frac{\sum_{i=1}^{n}f_{\mathcal{I}^{*}_{p}}(\mathbf{x}_{i})}{n}-f_{\mathcal{I}^{*}_{p}}(\mathbf{x}_{j_{p+1}})\geq\frac{\sum_{i=1}^{n}f_{\mathcal{I}^{*}_{p}}(\mathbf{x}_{i})}{n}-f_{\mathcal{I}^{*}_{p}}(\mathbf{x}_{j})\geq 0

therefore, we have for B=2KB=2K,

Δα(𝐰p+1)\displaystyle\Delta_{\alpha}(\mathbf{w}_{p+1}) (12)
=\displaystyle= g^(𝐰p)g^(𝐰)+2p+1(𝐞j𝐰p)𝐊(𝐰p𝐰^)+1(p+1)2(𝐞j𝐰p)𝐊(𝐞j𝐰p)\displaystyle\widehat{g}\left(\mathbf{w}_{p}\right)-\widehat{g}\left(\mathbf{w}_{*}\right)+\frac{2}{p+1}\left(\mathbf{e}_{j}-\mathbf{w}_{p}\right)^{\top}\mathbf{K}\left(\mathbf{w}_{p}-\widehat{\mathbf{w}}\right)+\frac{1}{(p+1)^{2}}\left(\mathbf{e}_{j}-\mathbf{w}_{p}\right)^{\top}\mathbf{K}\left(\mathbf{e}_{j}-\mathbf{w}_{p}\right)
=\displaystyle= pp+1(g^(𝐰p)g^(𝐰))+1(p+1)2B=pp+1Δα(𝐰p)+1(p+1)2B\displaystyle\frac{p}{p+1}(\widehat{g}\left(\mathbf{w}_{p}\right)-\widehat{g}\left(\mathbf{w}_{*}\right))+\frac{1}{(p+1)^{2}}B=\frac{p}{p+1}\Delta_{\alpha}(\mathbf{w}_{p})+\frac{1}{(p+1)^{2}}B

where 𝐰p+1=p𝐰p/(p+1)+𝐞j/(p+1)\mathbf{w}_{p+1}=p\mathbf{w}_{p}/(p+1)+\mathbf{e}_{j}/(p+1), and obviously BB upper bounds (𝐞j𝐰p)𝐊(𝐞j𝐰p)\left(\mathbf{e}_{j}-\mathbf{w}_{p}\right)^{\top}\mathbf{K}\left(\mathbf{e}_{j}-\mathbf{w}_{p}\right). Since α1\alpha\leq 1, it holds from Lemma B.6 that

Δα(𝐰1)MMDk,α,𝐗n2(𝐰1)(1+α2)KB\Delta_{\alpha}(\mathbf{w}_{1})\leq\operatorname{MMD}^{2}_{k,\alpha,\mathbf{X}_{n}}(\mathbf{w}_{1})\leq(1+\alpha^{2})K\leq B

therefore by Lemma B.1, we have

MMDk,α2(𝐗m,𝐗n)=MMDk,α,𝐗n2(𝐰p)Cα2+B2+logmm+1\operatorname{MMD}_{k,\alpha}^{2}(\mathbf{X}_{\mathcal{I}^{*}_{m}},\mathbf{X}_{n})=\operatorname{MMD}_{k,\alpha,\mathbf{X}_{n}}^{2}(\mathbf{w}_{p})\leq C_{\alpha}^{2}+B\frac{2+\log m}{m+1}

Appendix D Additional Experimental Details and Results

D.1 Supplementary Numerical Experiments on GKHR

Consider the fact that GKH is a convergent algorithm (Lemma B.2) and the finite-sample-size error bound (10) holds without any assumption on the data, we conduct some numerical experiments to empirically compare GKHR with GKH on datasets generated by four different distributions on 2\mathbb{R}^{2}.

Firstly, we define four distributions on 2\mathbb{R}^{2}:

  1. 1.

    Gaussian mixture model 1 which consists of four Gaussian distributions G1,G2,G3,G4G_{1},G_{2},G_{3},G_{4} with mixture weights [0.95,0.01,0.02,0.02][0.95,0.01,0.02,0.02],

  2. 2.

    Gaussian mixture model 2 which consists of four Gaussian distributions G1,G2,G3,G4G_{1},G_{2},G_{3},G_{4} with mixture weights [0.3,0.2,0.15,0.35][0.3,0.2,0.15,0.35],

  3. 3.

    Uniform distribution 1 which consists of a uniform distribution defined in a circle with radius 0.50.5, and a uniform distribution defined in a annulus with inner radius 44 and outer radius 66,

  4. 4.

    Uniform distribution 2 defined on [10,10]2[-10,10]^{2}.

where

G1=𝒩([12],[2005]),G2=𝒩([35],[1002])G_{1}=\mathcal{N}\left(\begin{bmatrix}1\\ 2\end{bmatrix},\begin{bmatrix}2&0\\ 0&5\end{bmatrix}\right),G_{2}=\mathcal{N}\left(\begin{bmatrix}-3\\ -5\end{bmatrix},\begin{bmatrix}1&0\\ 0&2\end{bmatrix}\right)
G3=𝒩([54],[8006]),G4=𝒩([1510],[4009])G_{3}=\mathcal{N}\left(\begin{bmatrix}-5\\ 4\end{bmatrix},\begin{bmatrix}8&0\\ 0&6\end{bmatrix}\right),G_{4}=\mathcal{N}\left(\begin{bmatrix}15\\ 10\end{bmatrix},\begin{bmatrix}4&0\\ 0&9\end{bmatrix}\right)
Refer to caption
Figure 2: The performance comparison between GKHR and GKH with different m,nm,n over ten independent runs. The blue line is the mean value of DD, the red dotted line over (under) the blue line is the mean value of DD plus (minus) its standard deviation, and the pink area is the area between the upper and lower red dotted lines.

To consistently evaluate the performance gap between GKHR and GKH at the same order of magnitude, we propose the following criterion

D=D1D2D1+D2D=\frac{D_{1}-D_{2}}{D_{1}+D_{2}}

where D1=MMDk,α2(𝐗m(1),𝐗n),D2=MMDk,α2(𝐗m(2),𝐗n)D_{1}=\operatorname{MMD}^{2}_{k,\alpha}(\mathbf{X}^{(1)}_{\mathcal{I}^{*}_{m}},\mathbf{X}_{n}),D_{2}=\operatorname{MMD}^{2}_{k,\alpha}(\mathbf{X}^{(2)}_{\mathcal{I}^{*}_{m}},\mathbf{X}_{n}), 𝐗m(1)\mathbf{X}^{(1)}_{\mathcal{I}_{m}} is the selected samples from GKHR and 𝐗m(2)\mathbf{X}^{(2)}_{\mathcal{I}_{m}} is the selected samples from GKH. Positive value of DD implies that GKH outperforms GKHR, and negative values of DD implies that GKHR outperforms GKH. Large absolute value of DD shows large performance gap.

The experiments are conducted as follows. We generate 1000,3000,10000,30000 random samples from the four distributions separately, then use GKHR and GKH for sample selection under the low-budget setting, i.e., m/n0.2m/n\leq 0.2. The α\alpha is set by m/nm/n. We report the results over ten independent runs in Figure 2, which shows that although the performance gap tends to grow as mm grows, when mm is relatively small, the performance of GKHR is similar to that of GKH. Therefore, under the low-budget setting, GKHR and GKH have similar performance on minimizing α\alpha-MMD over various type of distributions, which convinces us that GKHR could work well in the sample selection task.

D.2 Datasets

For experiments, we choose five common datasets: CIFAR-10/100, SVHN, STL-10 and ImageNet. CIFAR-10 and CIFAR-100 contain 60,000 images with 10 and 100 categories, respectively, among which 50,000 images are for training, and 10,000 images are for testing; SVHN contains 73,257 images for training and 26,032 images for testing; STL-10 contains 5,000 images for training, 8,000 images for testing and 100,000 unlabeled images as extra training data. ImageNet spans 1,000 object classes and contains 1,281,167 training and 100,000 test images. The training sets of the above datasets are considered as the unlabeled dataset for sample selection.

D.3 Visualization of Selected Samples

To offer a more intuitive comparison between various sampling methods, we visualized samples chosen by stratified, random, kk-Means, USL, ActiveFT and RDSS (ours). We generate 5000 samples from a Gaussian mixture model defined on 2\mathbb{R}^{2} with 10 components and uniform mixture weights. One hundred samples are selected from the entire dataset using different sampling methods. The visualisation results in Figure 3 indicate that our selected samples distribute more similarly with the entire dataset than other counterparts.

Refer to caption
Figure 3: Visualization of selected samples using different sampling methods. Points of different colours represent samples from different classes, while black points indicate the selected samples.

D.4 Computational Complexity and Running Time

We compute the time complexity of various sampling methods and recorded the time required to select 400 samples on the CIFAR-100 dataset for each method. The results are presented in Table 5, where mm represents the annotation budget, nn denotes the total number of samples, and TT indicates the number of iterations. The sampling time was obtained by averaging the duration of three independent runs of the sampling code on an idle server without any workload. As illustrated by the results, the sampling efficiency of our method surpasses that of all other methods except for random and stratified sampling. This discrepancy is likely because the execution time of other algorithms is affected by the number of iterations TT.

Table 5: Efficiency comparison with other sampling methods.
Method Time complexity Time (s)
Random O(n)O(n) 0\approx 0
Stratified O(n)O(n) 0\approx 0
kk-means O(mnT)O(mnT) 579.97579.97
USL O(mnT)O(mnT) 257.68257.68
ActiveFT O(mnT)O(mnT) 224.35224.35
RDSS (Ours) O(mn)O(mn) 132.77132.77

D.5 Implementation Details of Supervised Learning Experiments

We use ResNet-18 [16] as the classification model for all AL approaches and our method. Specifically, We train the models for 300300 epochs using SGD optimizer (initial learning rate=0.10.1, weight decay=5e45e-4, momentum=0.90.9) with batch size 128128. Finally, we evaluate the performance with the Top-1 classification accuracy metric on the test set.

D.6 Direct Comparison with AL/SSAL

The comparative results with AL/SSAL approaches are shown in Figure 4 and Figure 5, respectively. The specific values corresponding to the comparative results in the above two figures are shown in Table 6. And the above results are from  [8],  [12] and  [17].

Table 6: Comparative results with AL/SSAL approaches.
Dataset CIFAR-10 CIFAR-100
Budget 40 250 500 1000 2000 4000 5000 7500 10000 400 2500 5000 7500 10000
Active Learning (AL)
CoreSet [36] - - - - - - 80.56 85.46 87.56 - - 37.36 47.17 53.06
VAAL [39] - - - - - - 81.02 86.82 88.97 - - 38.46 47.02 53.99
LearnLoss [60] - - - - - - 81.74 85.49 87.06 - - 36.12 47.81 54.02
MCDAL [8] - - - - - - 81.01 87.24 89.40 - - 38.90 49.34 54.14
Semi-Supervised Active Learning (SSAL)
CoreSetSSL [36] - - 90.94 92.34 93.30 94.02 - - - - - 63.14 66.29 68.63
CBSSAL [12] - - 91.84 92.93 93.78 94.55 - - - - - 63.73 67.14 69.34
TOD-Semi [17] - - - - - - 79.54 87.82 90.3 - - 36.97 52.87 58.64
Semi-Supervised Learning (SSL) with RDSS
FlexMatch+RDSS (Ours) 94.69 95.21 - - - 95.71 - - - 48.12 67.27 - - 73.21
FreeMatch+RDSS (Ours) 95.05 95.50 - - - 95.98 - - - 48.41 67.40 - - 73.13

According to the results, we have several observations: (1) AL approaches often necessitate significantly larger labelling budgets, exceeding RDSS by 125 or more on CIFAR-10. This is primarily because AL paradigms are solely dependent on labelled samples not only for classification but also for feature learning. (2) SSAL and our methods leverage unlabeled samples, surpassing traditional AL approaches. However, this may not directly reflect the advantages of RDSS, as such performance enhancements could be inherently attributed to the SSL paradigm itself. Nonetheless, these experimental outcomes offer insightful implications: SSL may represent a more promising paradigm under scenarios with limited annotation budgets.

Refer to caption
Figure 4: Comparison with AL/SSAL approaches on CIFAR-10.
Refer to caption
Figure 5: Comparison with AL/SSAL approaches on CIFAR-100.

Appendix E Limitation

The choice of α\alpha depends on the number of full unlabeled data points, independent of the information on the shape of data distribution. This may lead to a loss of effectiveness of RDSS on those datasets with complicated distribution structures. However, it outperforms fixed-ratio approaches on the datasets under different budget settings.

Appendix F Potential Societal Impact

Positive societal impact. Our method ensures the representativeness and diversity of the selected samples and significantly improves the performance of SSL methods, especially under low-budget settings. This reduces the cost and time of data annotation and is particularly beneficial for resource-constrained research and development environments, such as medical image analysis.

Negative societal impact. When selecting representative data for analysis and annotation, the processing of sensitive data may be involved, increasing the risk of data leakage, especially in sensitive fields such as medical care and finance. It is worth noting that most algorithms applied in these sensitive areas are subject to this risk.