Noise-Tolerant Hybrid Prototypical Learning with Noisy Web Data

Chao Liang cs.chaoliang@zju.edu.cn Zhejiang UniversityZhejiangChina310027 , Linchao Zhu zhulinchao@zju.edu.cn Zhejiang UniversityZhejiangChina310027 , Zongxin Yang yangzongxin@zju.edu.cn Zhejiang UniversityZhejiangChina310027 , Wei Chen chenvis@zju.edu.cn Zhejiang UniversityZhejiangChina310027 and Yi Yang yangyics@zju.edu.cn Zhejiang UniversityZhejiangChina310027

(2018)

Abstract.

We focus on the challenging problem of learning an unbiased classifier from a large number of potentially relevant but noisily labeled web images given only a few clean labeled images. This problem is particularly practical, because it reduces the expensive annotation costs by utilizing freely accessible web images with noisy labels. Typically, prototypes are representative images or features used to classify or identify other images. However, in the few clean and many noisy scenarios, the class prototype can be severely biased due to the presence of irrelevant noisy images. The resulting prototypes are less compact and discriminative, as previous methods do not take into account the diverse range of images in the noisy web image collections. On the other hand, the relation modeling between noisy and clean images is not learned for the class prototype generation in an end-to-end manner, which results in a suboptimal class prototype. In this paper, we introduce a similarity maximization loss named the SimNoiPro. Our SimNoiPro first generates noise-tolerant hybrid prototypes composed of clean and noise-tolerant prototypes, and then pulls them closer to each other. Our approach considers the diversity of noisy images by explicit division and overcomes the optimization discrepancy issue. This enables better relation modeling between clean and noisy images and helps extract judicious information from the noisy image set. The evaluation results on two extended few-shot classification benchmarks confirm that our SimNoiPro outperforms prior methods in measuring image relations and cleaning noisy data.

Deep learning, learn from noisy labels, few-shot learning, prototypical learning

^†^†copyright: acmcopyright^†^†journalyear: 2018^†^†doi: XXXXXXX.XXXXXXX^†^†journal: JACM^†^†journalvolume: 37^†^†journalnumber: 4^†^†article: 111^†^†publicationmonth: 8^†^†ccs: Computing methodologies Visual content-based indexing and retrieval^†^†ccs: Computing methodologies Object recognition

1. Introduction

Deep learning has revolutionized many computer vision tasks such as classification (He et al., 2016; Luo et al., 2015; Krizhevsky et al., 2012; Zhao et al., 2024), segmentation (He et al., 2017; Yang et al., 2021a, 2024b), image/video understanding (Zhu et al., 2022; Fan et al., 2020; Wang et al., 2023; Yang et al., 2024a; Zhang et al., 2024; Li et al., 2024), object detection (Ren et al., 2015; Redmon et al., 2016; Xu et al., 2021), and few-shot learning (Douze et al., 2018; Finn et al., 2017; Gidaris and Komodakis, 2018; Iscen et al., 2020; Siam et al., 2019; Snell et al., 2017; Vinyals et al., 2016; Zhang et al., 2020; He et al., 2022; Shi et al., 2023; Ravi and Larochelle, 2017; Rusu et al., 2018; He et al., 2020; Santoro et al., 2016). State-of-the-art visual system is often exceptionally data-hungry and requires massive well-annotated data. To mitigate the demand for labeled data, few-shot visual recognition tends to learn a model that can be easily generalized with only a few labeled examples on the end task. Though the representation is learned with available many labeled images, the few-shot classifier is easily biased with limited examples.

Refer to caption — Figure 1. Relevant images exist in web images. Massive freely accessible web images can be obtained by search engines. These images are easily acquired but can be inevitably annotated with noisy labels.

As computational power continues to develop and web data scales up, some researchers have proposed to use large-scale external data sources to enhance the few-shot classifier (Douze et al., 2018; Iscen et al., 2020). Often, there are a few closely related images among a large number of irrelevant noisy images (Figure 1). Prototypical learning is widely used for classification (Snell et al., 2017; Li et al., 2021a) where the class prototypes are the representative images or features used to classify or identify other images. However, the presence of irrelevant noisy images can severely affect the generation of robust class prototypes. To improve the generalization of such few-shot learners, the existing works have attempted to identify the relevance between noisy images and the support clean image set and assigned high relevance scores to those relevant images in the noisy image set. Once relevance is accurately established, those relevant images in the noisy set could be beneficial for learning the noise-tolerant class prototypes.

Recently, Iscen et al. (Iscen et al., 2020) proposed to use a graph convolutional neural network (GCN) to address the problem of few clean and many noisy examples. The GCN is optimized using a binary classifier to discriminate clean from noisy images. By modeling with the graph structure, this method considers the relationships between clean and noisy examples, generating a relevance score for each noisy example based on its visual similarity to clean examples. The results of this GCN-based approach are very promising, demonstrating that noisy examples can benefit few-shot classifier learning. However, the approach neglects the considerable diversity within the noisy image set. It leverages only one global noise-tolerant prototype to model the noisy set at a coarse level. This approach fails to represent the complex noisy collections of diverse web images effectively. Hence, it might produce unsatisfactory relevance scores. Furthermore, using the binary cross entropy loss as the training objective results in a discrepancy issue. The binary classifier treats all the web images with noisy labels as negative instances. It forces the model to produce low relevance scores for those relevant images in the noisy set, even though those visually similar examples can contribute to the learning of the few-shot classifier. Also, the relevance scores are not learned for the class prototype generation in an end-to-end manner, which leads to a suboptimal solution.

To address the challenges posed by the diverse noisy image set and the optimization discrepancy issue, we propose a similarity maximization loss named the SimNoiPro, for the few clean and many noisy problem. Our SimNoiPro involves two steps.

First, we build noise-tolerant hybrid prototypes to model the complex data collections, by dividing the noisy set into a few groups based on their relevance scores. Typically, noisy and clean data hold distinct properties. Noisy data points are more widely dispersed in the feature space. Using one coarse noise-tolerant prototype may significantly bias against the few clean samples and lead to large intra-class variance. The introduction of multiple noise-tolerant prototypes offers more refined modeling of the large-scale diverse noisy set.

Second, we propose a similarity maximization loss to generate more compact and discriminative class prototypes. A good prototype should exhibit the characteristics of compactness and discrimination. This requires pulling the noise-tolerant prototypes and the clean prototype closer to each other, which can produce a tighter cluster and diminish intra-class variance. Also, our similarity maximization loss bridges the optimization between relevance scores and the class prototype generation in an end-to-end manner. It overcomes the optimization discrepancy issue caused by the binary classification loss, resulting in a more plausible and effective clean-noisy relation.

Our proposed approach yields promising results on two standard few-shot classification benchmarks, i.e., Low-shot Places365 and Low-shot ImageNet. Notably, we outperform (Iscen et al., 2020) by 4.4% and 3.5% in the 5-shot setting for Low-shot Places365 and Low-shot ImageNet, respectively.

We summarize our contributions in the following:

•

We propose noise-tolerant hybrid prototypes to handle the diversity of the noisy web data.
•

We propose a similarity maximization loss to bridge the optimization between the relevance scores and class prototype generation in an end-to-end manner, generating compact and discriminative class prototypes.
•

Extensive experiments on Low-shot Places365 and Low-shot ImageNet demonstrate the effectiveness of our proposed method, which outperforms other baselines consistently.

The rest of the paper is organized as follows. We first introduce the related works in Section 2. Section 3 reviews the previous GCN cleaning framework and details our proposed SimNoiPro. Section 4 presents performance evaluation and visualization, followed by the conclusion in Section 5.

2. Related Work

2.1. Few-shot Learning

In the past few years, there has been a growing interest in few-shot learning, which aims to mitigate the demand for labeled data by learning from a few labeled training examples. The existing few-shot learning research (Douze et al., 2018; Finn et al., 2017; Gidaris and Komodakis, 2018; Iscen et al., 2020; Siam et al., 2019; Snell et al., 2017; Sung et al., 2018; Vinyals et al., 2016; Zhang et al., 2020) can be broadly classified into several categories. One of the promising directions is based on the meta-learning paradigm, also known as “learn to learn”. This paradigm focuses on capturing generic knowledge during the meta-training stage and then adapting to a completely new task rapidly. Specifically, meta-learning based methods follow the episode paradigm in the few-shot regime. Among these methods, metric-based approaches and optimization-based approaches are the two main streams in the literature. Metric-based methods aim to enhance the discriminative features in the embedding space and learn a good distance function over them. For instance, Matching Network (Vinyals et al., 2016) leverages the attention mechanism to align the query and support examples. Prototypical Network (Snell et al., 2017) extends this approach to compare the Euclidean distance between the class representations. Furthermore, RelationNet (Sung et al., 2018) proposes to learn a good metric by using deep networks, while DeepEMD (Zhang et al., 2020) employs the Earth Mover’s distance to compute the structure similarity. In the optimization-based methods, the key idea is to adjust the optimization algorithm so that the model can learn from few data better. Meta-LSTM(Ravi and Larochelle, 2017) proposes the LSTM-based meta-learner to optimize the classifier. MAML (Finn et al., 2017) aims to find good initial parameters where fast convergence can be achieved given limited training examples.

Another class of few-shot learning approaches is based on transfer learning (Hariharan and Girshick, 2017; Gidaris and Komodakis, 2018; Douze et al., 2018). These methods leverage the large amount of data from base classes to learn a robust model, which is then finetuned on the few labeled training examples from the novel classes, providing the capability to recognize the novel classes. Hariharan et al. (Hariharan and Girshick, 2017) propose a generative method for data augmentation. Gidaris et al. (Gidaris and Komodakis, 2018) design an attention-based generator to dynamically predict the weights for the novel classes. Douze et al. (Douze et al., 2018) formulate the few-shot problem as a semi-supervised setting by introducing large-scale collections of unlabeled images.

2.2. Learn from Noisy Data

High-quality human-annotated data are costly and time-consuming to obtain (Zhai et al., 2022; Ricci et al., 2023). Recently, several works have resorted to large-scale web images from social media to facilitate the learning of the deep neural network. These images are easily acquired by search engines but can be inevitably annotated with noisy labels. Deep neural networks are prone to overfit on noisy labels (Zhang et al., 2016). Learning with too many noisy labels can impair the generalization of deep models. The existing noise-resistant methods can be primarily divided into three types: 1) label correction by the predictions from the deep models (Ma et al., 2018; Reed et al., 2014; Tanaka et al., 2018; Yi and Wu, 2019; Liang et al., 2023a). 2) sample selection by filtering noisy instances (Arazo et al., 2019; Sun et al., 2022; Liang et al., 2023b). 3) sample reweighting by assigning confidence scores for each noisy data (Ren et al., 2018; Shu et al., 2019; Jiang et al., 2018). Particularly, (Ren et al., 2018) and (Shu et al., 2019) assume there is a small unbiased and clean validation set. This setting is similar to our work, but we follow the few-shot learning setting where deep models are pretrained on large data from base classes and focus more on classifier learning with limited clean and many noisy data in unseen domains. Both (Englesson and Azizpour, 2021) and (Iscen et al., 2022) consider the consistency of the network prediction for noise-robust learning. GJS (Englesson and Azizpour, 2021) encourages consistency around data points. NCR (Iscen et al., 2022) proposes an additional loss to penalize the divergence of the predictions from the neighbors in the feature space.

2.3. Prototypical Learning

The prototype can be considered as the representation of a cluster of semantically similar instances. Prototype-based learning has been widely applied in learning from noisy labels (Li et al., 2021a), unsupervised learning (Caron et al., 2020; Li et al., 2021b), and especially few-shot learning (Snell et al., 2017; Tang and Yu, 2023). Class prototype reflects a simple inductive bias, which brings about well-generalized performance in the limited-data regime. Considering the fact that the noisy web image set is diverse, using only one global prototype cannot represent the whole noisy set well (Yang et al., 2021b).

3. Method

3.1. Preliminary

3.1.1. Problem statement

We aim to learn an unbiased few-shot classifier with the aid of additional large-scale web data with noisy labels. Generally, there are two stages in the few-shot classification.

The first stage is the representation learning phase. In this stage, we are given a large clean labeled dataset $\mathcal{D}_{base}=\{\mathcal{D}_{base}^{1},\mathcal{D}_{base}^{2},\ldots,\mathcal{D}_{base}^{L}\}$ , where $L$ is the number of categories in the base set. $\mathcal{D}_{base}^{l}$ denotes the set of images for base class $l$ . $\mathcal{D}_{base}^{l}=\{(x_{i}^{l},y_{i}^{l})|i=1,...,n\}$ , where $n$ is the number of training examples. The base set $\mathcal{D}_{base}$ is leveraged to learn a strong feature extractor $\Phi:x\mapsto\Phi(x)\in\mathbb{R}^{d}$ . Here, $d$ denotes the dimension of the feature.

In the second stage, we receive a novel dataset $\mathcal{D}_{novel}$ with few clean labeled examples. The goal is to adapt the learned feature extractor to the novel dataset and train a robust classifier given only few clean examples. We define $\mathcal{D}_{novel}=\{\mathcal{D}_{novel}^{1},\mathcal{D}_{novel}^{2},\ldots,\mathcal{D}_{novel}^{C}\}$ and $C$ is the number of classes in the novel set. For each novel class $c$ , $\mathcal{D}_{novel}^{c}=\{(x_{i}^{c},y_{i}^{c})|i=1,...,k\}$ , where $k$ is the number of novel examples and $k\ll n$ , usually called the $k$ -shot setting. Note that the base classes in $\mathcal{D}_{base}$ have no overlaps with the novel classes in $\mathcal{D}_{novel}$ .

In this paper, we focus on the second stage of few-shot learning and tackle this challenge with an additional large-scale noisy dataset $\mathcal{D}_{noisy}$ , particularly from the Internet. The problem is formulated as learning an unbiased classifier from many freely accessible web images with noisy labels given only few clean labeled images. Combining both clean labeled data and large amounts of noisy ones, we obtain the whole dataset $\widetilde{\mathcal{D}}_{novel}^{c}=\mathcal{D}_{novel}^{c}\cup\mathcal{D}_{noisy}^{c}$ for each novel class $c$ . Note that $\mathcal{D}_{novel}^{c}$ is the image set with few clean labeled examples and $\mathcal{D}_{noisy}^{c}$ consists of a large number of images with noisy labels. We expect to identify the relevant images from the noisy set. Training with these relevant images can potentially enhance the generalization of the few-shot classifier.

We extract the features of examples in $\widetilde{\mathcal{D}}_{novel}^{c}$ using the learned feature extractor $\Phi$ . The clean and noisy features are represented by the matrix $V^{c}=[\mathbf{v}_{1}^{c},...,\mathbf{v}_{k}^{c},...,\mathbf{v}_{N}^{c}]\in\mathbb{R}^{d\times N}$ , where $\mathbf{v}_{i}^{c}=\Phi(x_{i}^{c})\in\mathbb{R}^{d}$ , $x_{i}^{c}$ is a training example from $\widetilde{\mathcal{D}}_{novel}^{c}$ and $N$ is the number of training examples in $\widetilde{\mathcal{D}}_{novel}^{c}$ . We assume the first $k$ features are from the clean set and the remaining are from the noisy set. For convenience, we ignore the superscript $c$ if it can be inferred from the context.

3.1.2. Review of GCN cleaning framework

Our work extends the GCN cleaning framework (Iscen et al., 2020) by directly bridging the optimization between the relevance scores and noise-tolerant hybrid prototypes generation in an end-to-end manner. Firstly, we review the relation modeling with GCN.

Iscen et al. (Iscen et al., 2020) introduced a noise cleaning framework to tackle the few clean and many noisy problem. Their approach is divided into two stages: a graph cleaning and a classifier learning stage.

First, they applied a two-layered GCN to perform offline cleaning by predicting a class relevance score $r$ for each noisy example in the $\mathcal{D}_{noisy}^{c}$ . The class relevance scores $\mathbf{r}\in\mathbb{R}^{N}$ are learned independently for each novel class $c$ :

(1)

\displaystyle\mathbf{r}=F_{\Theta}(\widetilde{A},V)=\textit{Sigmoid}(\Theta_{2}^{\top}[\Theta_{1}^{\top}V\widetilde{A}]_{+}\widetilde{A}),

where $\Theta=\{\Theta_{1},\Theta_{2}\}$ is the GCN parameters, $[\cdot]_{+}=ReLU(\cdot)$ and $\widetilde{A}$ is the normalized affinity matrix. The training process is constrained by a binary classification loss $L_{\Theta}$ :

(2)

L_{\Theta}=-\frac{1}{k}\sum_{i=1}^{k}{\log r_{i}}-{\frac{\lambda}{N-k}}\sum_{i=k+1}^{N}{\log(1-r_{i})},

where $\lambda$ is a hyperparameter for balancing. This loss aims to classify the clean labeled images as positive examples and treat all the images with noisy labels as negative examples.

In general, a class prototype $\mathbf{p}_{c}$ for category $c$ is defined as below:

(3)

\mathbf{p}_{c}=\frac{1}{r(\widetilde{\mathcal{D}}_{novel}^{c})}\sum_{i=1}^{N}{r_{i}\mathbf{v}_{i}},

where the normalization term $r(\widetilde{\mathcal{D}}_{novel}^{c})=\sum_{i=1}^{N}{r_{i}^{c}}$ . The prototype classifier $\mathbf{P}$ consists of $C$ prototypes, i.e. $\mathbf{P}=[\mathbf{p}_{1},...,\mathbf{p}_{C}]\in\mathbb{R}^{d\times C}$ .

In prototypical learning, each class would produce a single prototype to serve as a representative feature for discriminative classification. In the few clean and many noisy classification, the unified prototype could be learned from the combination of a clean prototype and a noise-tolerant prototype based on Eq. 3:

(4)

\displaystyle\mathbf{p}_{c}

\displaystyle=\mathbf{p}_{clean}+\mathbf{p}_{noise},

where

(5)		$\displaystyle\mathbf{p}_{clean}$	$\displaystyle=\frac{1}{r(\widetilde{\mathcal{D}}_{novel}^{c})}\sum_{i=1}^{k}{\mathbf{v}_{i}},$
(6)		$\displaystyle\mathbf{p}_{noise}$	$\displaystyle=\frac{1}{r(\widetilde{\mathcal{D}}_{novel}^{c})}\sum_{i=k+1}^{N}{r_{i}\mathbf{v}_{i}}.$

Second, a few-shot classifier is learned over the novel classes. Particularly, they used a cosine classifier to minimize the loss function $L(\widetilde{\mathcal{D}}_{novel};\mathbf{W})$ :

(7)

L(\widetilde{\mathcal{D}}_{novel};\mathbf{W})=-\sum_{c=1}^{C}{\frac{1}{r(\widetilde{\mathcal{D}}_{novel}^{c})}\sum_{i=1}^{|\widetilde{\mathcal{D}}_{novel}^{c}|}{r_{i}^{c}\log(\boldsymbol{\sigma}(s\mathbf{\hat{W}}^{\top}\mathbf{\hat{v}}_{i}^{c})_{c})}}.

Herein, $\boldsymbol{\sigma}(\cdot)$ is the softmax function, $s$ is the temperature parameter. Both the feature vectors $\mathbf{v}\in\mathbb{R}^{d}$ and the classifier $\mathbf{W}\in\mathbb{R}^{d\times C}$ are transformed into $\ell_{2}$ -normalized form, denoted as $\mathbf{\hat{v}}$ and $\mathbf{\hat{W}}$ .

Although this approach achieved promising results over those methods that only deal with clean data, one of the drawbacks is that noisy data are often diverse and one noise-tolerant prototype cannot represent the whole set. Meanwhile, the binary classification loss introduces a discrepancy between the graph cleaning process and the subsequent prototype-based classifier learning. There is no guarantee that the learned $\mathbf{p}_{noise}$ using the class relevance score $r$ would produce a meaningful prototype for classification.

3.2. Learning from Noise-tolerant Prototypes

As discussed in Section 3.1.2, using one global noise-tolerant prototype cannot represent the complex noisy web data, and the binary classification loss (Eq. 2) in (Iscen et al., 2020) introduces a discrepancy between the training and the inference stage.

We address the obstacles by introducing noise-tolerant hybrid prototypes composed of one clean and multiple noise-tolerant prototypes, and directly optimizing relevance scores and the noise-tolerant prototypes generation ( $\mathbf{p}_{noise}$ ) in an end-to-end manner. First, given the diversity in the noisy set, we divide the noisy examples into a few groups based on their relevance scores and obtain multiple noise-tolerant prototypes (Section 3.2.1). Second, we propose a similarity maximization loss named the SimNoiPro to pull the noise-tolerant prototypes to be closer to the clean prototype (Section 3.2.2). The overall framework is presented in Figure 2.

3.2.1. Noise-tolerant prototypes generation

The diversity of the noisy images motivates us to treat the noisy images differently. In the noisy image set, there might be a few closely relevant images that are visually similar to the clean examples. In the meantime, it is expected that there are a large number of noisy images that are irrelevant. It can be imagined that the noisy features are scattered in the embedding space. If we use one noise-tolerant prototype to represent the noisy set, it can incur a large intra-class variance. Based on this hypothesis, we propose to divide the noisy images into multiple groups and then generate multiple noise-tolerant prototypes. We define the noise-tolerant prototypes generation procedure as $G$ . Take the noisy “feature-relevance score” pairs as input, $G$ produces $T$ noise-tolerant prototypes based on some criteria,

(8)

\{\mathbf{p}_{noise}^{t}\}_{t=1}^{T}=G(\{(\mathbf{v}_{i},r_{i})\}_{i=k+1}^{N}).

Each noise-tolerant prototype $\mathbf{p}_{noise}^{t}$ is the representative “cluster” or “center” of the corresponding noise group. Each group also shares similar properties.

Following the above paradigm, we consider two types of noise-tolerant prototypes generation procedure $G$ .

Feature based clustering. One of the methods to split the noisy images is to perform clustering on the visual features, e.g. k-means, where the noisy images could be grouped into $T$ clusters $\{\mathbf{C}_{t}\}_{t=1}^{T}$ . The noise-tolerant prototype for each cluster $\mathbf{C}_{t}$ can be expressed as:

(9)

\mathbf{p}_{noise}^{t}=\sum_{i=k+1}^{N}\mathbbm{1}(\mathbf{v}_{i}\in\mathbf{C}_{t}){r_{i}}\mathbf{v}_{i}.

Since the noisy features are pre-computed in our setting, clustering can be done offline before the noise cleaning. One of the disadvantages is that the group is fixed and cannot be further adjusted during the graph cleaning stage.

Relevance score based separation. In this paper, we introduce a general strategy to divide noisy images into multiple groups. We group the noisy images based on the class relevance scores $\mathbf{r}$ (Eq. 1) generated by the same graph convolutional network $F_{\Theta}(\widetilde{A},V)$ . For each group $t\in[1,\ldots,T]$ , its relevance score window ${w}_{t}$ is denoted as:

(10)

\small\{r_{i}|r_{min}+\frac{r_{max}-r_{min}}{T}(t-1)\leq r_{i}<r_{min}+\frac{(r_{max}-r_{min})}{T}{t}\},

where $r_{max}$ and $r_{min}$ denote the maximum relevance score and the minimum relevance score, respectively. In each noisy window $w_{t}$ , we define the corresponding noise-tolerant prototype as the weighted average of the features of all noisy images. The noise-tolerant prototype for window $w_{t}$ (Eq. 10) is denoted as:

(11)

\mathbf{p}_{noise}^{t}=\sum_{i=k+1}^{N}\mathbbm{1}{(r_{i}\in w_{t})}{r_{i}}\mathbf{v}_{i}.

The generated noise-tolerant prototype $\mathbf{p}_{noise}^{t}$ can be regarded as the representative prototype for all noisy images in window $w_{t}$ . Because the relevance scores are learned from the relations between the noisy images and clean examples, it takes more factors into consideration to produce better separation results. Note that the groups are dynamically changed during the training. We adopt this strategy in most of our experiments.

The introduction of multiple noise-tolerant prototypes provides a separation for different noisy examples, which enables the modeling of the complex noisy web data in a finer manner.

3.2.2. SimNoiPro Loss

Given noise-tolerant hybrid prototypes consisted of $T$ diverse noise-tolerant prototypes and one clean prototype $\{\mathbf{p}_{noise}^{1},...,\mathbf{p}_{noise}^{t},...,\mathbf{p}_{noise}^{T},\mathbf{p}_{clean}\}$ , our objective is to yield a compact and discriminative prototype $\mathbf{p}_{c}$ for few-shot image classification. The expected prototype is supposed to be represented as the sum of the noise-tolerant prototypes and the clean prototype. However, computed from scarce clean examples and large-scale noisy examples, it is nearly hard to get an unbiased prototype to support the class decision boundary if there are no constraints between the noise-tolerant prototypes and the clean prototype. A discriminative prototype is expected to exhibit low intra-class variance in the few clean and many noisy scenarios. Since the clean examples are more reliable than the noisy ones, it is much more reasonable to require the expected prototype biased towards the clean prototype.

We introduce a similarity maximization loss to pull the noise-tolerant prototypes to be closer to the clean prototype. The cosine distance is used to measure the similarity between two prototypes $\mathbf{p}_{1}$ and $\mathbf{p}_{2}$ :

(12)

M(\mathbf{p}_{1},\mathbf{p}_{2})=-\frac{\mathbf{p}_{1}}{\left\lVert{\mathbf{p}_{1}}\right\rVert_{2}}{\cdot}\frac{\mathbf{p}_{2}}{\left\lVert{\mathbf{p}_{2}}\right\rVert_{2}},

where ${\left\lVert{\cdot}\right\rVert_{2}}$ is $\ell_{2}$ -norm. Our goal is to minimize the negative similarity between the two prototypes. The distance minimization enforces the learned noise-tolerant prototype to be relevant to the specific category. It can overcome the discrepancy between the relevance score generation process and the prototype-based classification stage, so as to learn a discriminative unified prototype from the clean prototype and the noise-tolerant prototypes in an end-to-end manner.

It is worth emphasizing that the noise-tolerant prototypes might hold different degrees of bias against the clean prototype. Specifically, we define the SimNoiPro loss based on Eq. 12 as below:

(13)

L=\frac{1}{T}\sum_{t=1}^{T}\alpha_{t}M(\mathbf{p}_{clean},\mathbf{p}_{noise}^{t})+\beta M(\mathbf{p}_{clean},\mathbf{p}_{noise}),

where $\mathbf{p}_{clean}$ is the clean prototype in Eq. 5, $\mathbf{p}_{noise}$ is the global noise-tolerant prototype in Eq. 6, $\alpha_{t}$ is a scaling factor to control the relative weight of the similarity between the clean prototype and each local noise-tolerant prototype and $\beta$ is a hyperparameter. In general, a noise-tolerant prototype that exhibits a higher degree of similarity to the clean prototype should be assigned a relatively more substantial proportion in the loss function.

Discussion. Iscen et al. (Iscen et al., 2020) used a binary classification loss during training to classify the noisy examples as negative instances and treat the clean examples as positive instances. It can assign low relevance scores for those relevant images in the noisy set. The relevance scores and the class prototype generation are separated into two stages. No constraints are imposed to maintain the quality of the prototype, which results in a suboptimal class prototype. In contrast, our SimNoiPro is easy to implement and it directly optimizes the relevance scores and the class prototype generation in an end-to-end manner, which could potentially produce better relevance scores in a more plausible way. In the experiments, we find SimNoiPro works better in the low-data regime, but comparably in the 1-shot setting. The clean prototype formed from one clean image might incur large intra-class variance, e.g. if the clean image is a photo of cucumbers growing on vines, it might be difficult to identify noisy images of cucumber slices in salad as relevant ones.

4. Experiments

4.1. Benchmarks and Evaluation

4.1.1. Datasets

We evaluate our method on two benchmarks: Low-shot Places365 (Zhou et al., 2017) and Low-shot ImageNet (Hariharan and Girshick, 2017).

Table 1. Comparisons between different methods on Low-shot Places365. Our method outperforms other baselines consistently. Iscen et al. ^† (Iscen et al., 2020) is reimplemented by ourselves. Best results are highlighted.

ResNet-10 - Few Clean Data
Methods	TOP-5 ACCURACY ON NOVEL CLASSES
Methods	k=1	k=2	k=5	k=10	k=20
Class proto. (Gidaris and Komodakis, 2018)	28.7 $\pm$ 1.12	38.0 $\pm$ 0.37	50.5 $\pm$ 0.51	57.9 $\pm$ 0.35	62.3 $\pm$ 0.25
ResNet-10 - Few Clean & Many Noisy Data
$\beta$ -weighting, $\beta=1$ (Iscen et al., 2020)	44.0 $\pm$ 0.34	45.7 $\pm$ 0.22	48.4 $\pm$ 0.31	50.0 $\pm$ 0.12	50.8 $\pm$ 0.25
GJS (Englesson and Azizpour, 2021)	44.4 $\pm$ 0.40	45.8 $\pm$ 0.67	49.2 $\pm$ 0.32	55.8 $\pm$ 0.33	61.6 $\pm$ 0.28
NCR (Iscen et al., 2022)	46.0 $\pm$ 0.40	46.9 $\pm$ 0.51	50.7 $\pm$ 0.17	56.5 $\pm$ 0.18	62.4 $\pm$ 0.30
Label Propagation (Iscen et al., 2020)	39.6 $\pm$ 0.78	46.5 $\pm$ 0.22	54.8 $\pm$ 0.42	59.6 $\pm$ 0.11	62.0 $\pm$ 0.14
MLP (Iscen et al., 2020)	46.9 $\pm$ 0.78	50.1 $\pm$ 0.38	55.4 $\pm$ 0.29	59.2 $\pm$ 0.26	61.5 $\pm$ 0.31
Iscen et al. ^† (Iscen et al., 2020)	48.71 $\pm$ 0.53	51.13 $\pm$ 0.40	54.26 $\pm$ 0.29	59.92 $\pm$ 0.34	63.84 $\pm$ 0.22
Ours	49.20 $\pm$ 0.27	52.83 $\pm$ 0.40	58.60 $\pm$ 0.20	61.59 $\pm$ 0.28	64.33 $\pm$ 0.28

Low-shot Places365 (Zhou et al., 2017) is divided into 183 test and 182 validation classes by (Iscen et al., 2020). Note that we treat all classes in Places365 as novel categories.

Low-shot ImageNet (Hariharan and Girshick, 2017) is created from ImageNet. The total 1000 classes from the ImageNet dataset are divided into 389 base classes and 611 novel classes. For the purpose of cross-validation, this benchmark is split into two disjoint sets where the test set contains 196 base classes and 311 novel classes and the remaining are in the validation set.

Noisy data statistics. In our experiments, the above two datasets are considered as the clean sets and are both extended by large-scale noisy images from the YFCC100M dataset (Thomee et al., 2016). YFCC100M consists of around 100M images collected from the Flickr where each image is attached to a text description. Refer to (Iscen et al., 2020), noisy images are selected if their text annotations contain the name of the novel class. Towards the end, the Low-shot Places365 and the Low-shot ImageNet datasets are supplied by extra 9,720,957 and 3,744,994 noisy images, respectively.

4.1.2. Evaluation metric

Following the standard evaluation protocol used in the few-shot setting (Hariharan and Girshick, 2017; Iscen et al., 2020), we perform 5 trials under different $k\in\{1,2,5,10,20\}$ shot setup. For each trial, we sample $k$ clean images per class from the clean set and combine them with all noisy data as the training dataset. We report the average top-5 accuracy over 5-trials on the novel class in the test set.

4.1.3. Training details

In our experiments, features are extracted from ResNet-10 and ResNet-50 by (Iscen et al., 2020). The feature extractor is trained on the base class from the Low-shot ImageNet. The dimension $d$ of the input features is 512 for ResNet-10 and 256 for ResNet-50 (after PCA), respectively. For graph cleaning stage, similar to (Iscen et al., 2020), we use Adam with a weight decay of 0.0005 as our optimizer. The initial learning rate is set to 0.1 for 100 iterations and decays by 0.1 every 30 iterations. We divide $T=5$ groups for all shot setups. The hyperparameters $\alpha$ and $\beta$ are tuned on the validation set. We cross-validate the possible values of $\alpha$ and $\beta$ in the interval $[0.01,1.0]$ . The step is 0.01 for $[0.01,0.1]$ and 0.1 otherwise. For classifier learning, we initialize the cosine classifier with the class prototype. The cosine classifier is trained with a batch size of 512 and optimized with Adam for 50 epochs. The learning rate starts from 0.1 and is finally reduced to 0.001 with cosine annealing (Loshchilov and Hutter, 2016). We set the temperature $s=15$ .

Table 2. Comparisons between different methods on Low-shot ImageNet. Our method outperforms other baselines consistently. Iscen et al. ^†(Iscen et al., 2020) is reimplemented by ourselves. Best results are highlighted.

ResNet-10 - Few Clean Data
Methods	TOP-5 ACCURACY ON NOVEL CLASSES
Methods	k=1	k=2	k=5	k=10	k=20
ProtoNets (Snell et al., 2017)	39.3	54.4	66.3	71.2	73.9
Class proto. (Gidaris and Komodakis, 2018)	45.3 $\pm$ 0.65	57.1 $\pm$ 0.37	69.3 $\pm$ 0.32	74.8 $\pm$ 0.20	77.8 $\pm$ 0.24
Class proto. w/Att. (Gidaris and Komodakis, 2018)	45.8 $\pm$ 0.74	57.4 $\pm$ 0.38	69.6 $\pm$ 0.27	75.0 $\pm$ 0.29	78.2 $\pm$ 0.23
ResNet-10 - Few Clean & Many Noisy Data
Similarity (Iscen et al., 2020)	49.8 $\pm$ 0.29	56.3 $\pm$ 0.27	64.2 $\pm$ 0.32	68.4 $\pm$ 0.14	71.2 $\pm$ 0.12
$\beta$ -weighting, $\beta=1$ (Iscen et al., 2020)	56.1 $\pm$ 0.06	56.4 $\pm$ 0.08	57.1 $\pm$ 0.05	57.7 $\pm$ 0.08	58.7 $\pm$ 0.06
$\beta$ -weighting, $\beta^{*}$ (Iscen et al., 2020)	55.6 $\pm$ 0.24	58.3 $\pm$ 0.14	63.4 $\pm$ 0.25	67.5 $\pm$ 0.34	71.0 $\pm$ 0.22
GJS (Englesson and Azizpour, 2021)	65.5 $\pm$ 0.33	66.9 $\pm$ 0.24	69.3 $\pm$ 0.32	73.9 $\pm$ 0.28	77.9 $\pm$ 0.09
NCR (Iscen et al., 2022)	66.7 $\pm$ 0.24	67.8 $\pm$ 0.12	70.5 $\pm$ 0.20	74.4 $\pm$ 0.26	78.4 $\pm$ 0.22
Label Propagation (Iscen et al., 2020)	62.6 $\pm$ 0.35	67.0 $\pm$ 0.41	74.6 $\pm$ 0.30	76.3 $\pm$ 0.23	77.7 $\pm$ 0.18
MLP (Iscen et al., 2020)	63.6 $\pm$ 0.41	68.8 $\pm$ 0.42	73.7 $\pm$ 0.25	75.6 $\pm$ 0.21	77.6 $\pm$ 0.21
Iscen et al. ^† (Iscen et al., 2020)	72.88 $\pm$ 0.44	74.94 $\pm$ 0.20	76.07 $\pm$ 0.21	78.78 $\pm$ 0.25	81.02 $\pm$ 0.30
Ours	72.92 $\pm$ 0.26	75.60 $\pm$ 0.36	78.83 $\pm$ 0.31	80.71 $\pm$ 0.21	81.06 $\pm$ 0.22
ResNet-50 - Few Clean Data
ProtoNets (Snell et al., 2017)	49.6	64.0	74.4	78.1	80.0
Class proto. (Gidaris and Komodakis, 2018)	50.1 $\pm$ 0.62	62.9 $\pm$ 0.43	74.9 $\pm$ 0.10	79.5 $\pm$ 0.25	82.1 $\pm$ 0.34
ResNet-50 - Few Clean & Many Noisy Data
GJS (Englesson and Azizpour, 2021)	73.0 $\pm$ 0.36	74.9 $\pm$ 0.35	77.3 $\pm$ 0.32	81.4 $\pm$ 0.28	84.6 $\pm$ 0.15
NCR (Iscen et al., 2022)	74.7 $\pm$ 0.21	76.5 $\pm$ 0.14	79.4 $\pm$ 0.29	82.5 $\pm$ 0.24	85.4 $\pm$ 0.17
Iscen et al. ^† (Iscen et al., 2020)	79.57 $\pm$ 0.30	81.48 $\pm$ 0.32	81.64 $\pm$ 0.26	84.75 $\pm$ 0.18	87.06 $\pm$ 0.14
Ours	80.30 $\pm$ 0.43	82.77 $\pm$ 0.24	85.17 $\pm$ 0.34	86.81 $\pm$ 0.10	87.24 $\pm$ 0.13

4.2. Evaluation Results

4.2.1. Baseline Setups

We compare our method with several baselines. These baselines include: (1) Class proto. (Gidaris and Komodakis, 2018): the class prototype is computed as the mean of the clean feature embeddings. (2) ProtoNets (Snell et al., 2017): a meta-learning approach for few-shot classification. (3) $\beta$ -weighting (Iscen et al., 2020): the relevance score is set as $\beta$ . (4) Label Propagation (Iscen et al., 2020): the relevance scores are obtained by solving a linear system. (5) MLP (Iscen et al., 2020): this model learns a nonlinear mapping for assigning relevance scores. (6) Similarity (Iscen et al., 2020): the relevance score is calculated as the cosine similarity between the data point and the class prototype. (7) GJS (Englesson and Azizpour, 2021): a generalized jensen-shannon divergence loss for learning with noisy labels. (8) NCR (Iscen et al., 2022): a neighbor consistency loss for combating with noisy labels. (9) Iscen et al. (Iscen et al., 2020): a recently proposed method based on the graph convolutional network for learning with few clean and many noisy labels. Please refer to (Iscen et al., 2020) for more details.

4.2.2. Quantitative analysis

We report the top-5 accuracy performance on Low-shot Places365 in Table 1. $\beta$ -weighting takes effect when clean images are limited. It performs worse than class proto if enough clean samples are provided. GJS and NCR get better results by constraining the network to output consistent predictions for noisy labels, but still lag behind our method by 9.4% and 7.9%, respectively, in the 5-shot setting. By measuring the relations between noisy and clean data, Label Propagation and MLP improve a lot in the low-shot settings. In the 20-shot setting, the improvement upon the class prototype is marginal. We observe that our method consistently outperforms other methods, especially in the 2/5/10-shot setup. The performance gaps relative to the Iscen et al. are +1.7%, +4.3% and +1.6%, respectively. In the 5-shot setting, our SimNoiPro provides 58.60% accuracy over the Iscen et al. 54.26%. It indicates we can learn a better few-shot classifier with the relevance scores generated by our consistent learning objective.

Table 2 presents the top-5 accuracy performance of different methods on Low-shot ImageNet. First, we can find that learning with additional noisy data brings significant improvement in the few-shot setting, especially 1-shot case, where we can gain more than 20% performance improvement. Even a simple baseline Similarity can improve 4% accuracy after using noisy data. This confirms that noisy data are potentially useful to facilitate the learning of the few-shot classifier if we can mine those relevant images and measure their similarities with clean ones. Second, compared to other methods that also use additional noisy data to enhance the classifier learning, our method achieves better performance, mostly in the 5-shot setup, where our model can gain the average top-5 accuracy up to 78.83% with the backbone of ResNet-10, while the Iscen et al. method (Iscen et al., 2020) can only reach 76.07%. Our SimNoiPro directly optimizes the class relevance scores for noise-tolerant prototypes generation so that it can overcome the inconsistent optimization issues in the baseline method and produce a more discriminative prototype for classification. Note our SimNoiPro can even offer comparable performance in the 5-shot setting to the Iscen et al. (78.78%) where 10 clean images are given. This is very beneficial and efficient when deployed to the real-world scenario because it indicates that few well-annotated clean data are good enough to identify relevant images from a noisy set and improve the generalization of the few-shot learner. We also notice that our method achieves comparable results under the $k\in\{1,20\}$ shot settings. The comparable performance on the 1-shot setting might be due to the lack of a representative clean prototype. Given only one clean image, noise cleaning can be much more biased. For the 20-shot setting, the highly discriminative capacity of the prototype composed of 20 clean images might lead to limited improvement. Last, if we use a stronger backbone such as ResNet-50 in our experiment, more discriminative features are extracted, which contributes to better performance. Our approach yields an enhancement of 3.5% and 2.1% in relative improvement over the strong Iscen et al. baseline in the 5-shot and 10-shot settings, respectively. These results suggest that our SimNoiPro works better by introducing multiple noise-tolerant prototypes to model the diversity of the noisy web image set and incorporating the class prototype generation into the noise cleaning procedure to produce more plausible relevance scores. The noise-tolerant prototype can be much more compact and discriminative with lower intra-class variance. Training with the cosine classifier can boost performance.

4.3. Ablation Study

In this subsection, we conduct several ablation studies: (1) classification accuracy of the class prototype; (2) different number of noise groups; (3) different types of noise-tolerant prototypes generation; (4) replace the similarity maximization loss with the similarity minimization loss.

Table 3. Class prototype classification. Our method can generate more discriminative and compact prototypes for performance improvement. We report average top-5 accuracy under the 5-shot setting. Iscen et al. ^† (Iscen et al., 2020) is reimplemented by ourselves.

Methods	Low-shot Places365	Low-shot ImageNet
Iscen et al. ^† (Iscen et al., 2020)	53.80 $\pm$ 0.26	73.76 $\pm$ 0.20
Ours	57.16 $\pm$ 0.45	75.38 $\pm$ 0.28

4.3.1. Classification accuracy of the class prototype

We directly leverage the class prototype to perform the classification. The test image is classified by the nearest matching. As seen in Table 3, our method outperforms the baseline method on both Low-shot Places365 and Low-shot ImageNet. The improvement is 3.4% and 1.6%, respectively. Our SimNoiPro enables better relevance score generation, which results in more discriminative and compact prototypes.

4.3.2. Effect of the number of noise-tolerant prototypes

We investigate the influence of the number of noise groups on the performance in our few-shot setting. Here, we evaluate our method under the 5-shot setting. We divide the noisy data into $T\in\{1,2,3,4,5,6\}$ groups, and each group is assigned an equal weight, i.e., $\alpha_{t}$ is set to $1.0$ for all groups. The results on Low-shot ImageNet are shown in Figure 4. We observe that the top-5 classification accuracy improves when we increase the number of noise groups but reaches saturation when $T=4$ . This indicates that SimNoiPro is less sensitive to the number of noise groups when the number of noise groups is larger than 4. Compared to the case when $T=1$ , more noise-tolerant prototypes bring the improvement of the performance, showing that the introduction of multiple noise-tolerant prototypes plays an important role in the modeling of the diverse noisy set.

4.3.3. Effect of the hyperparameter setting in our SimNoiPro loss.

We study the effect of the hyperparameter by setting $\alpha_{t}=0$ or $\beta=0$ individually. The results are shown in Table 4. We find that removing any of them can degrade the performance. The accuracy drops by 8.8% when $\beta=0$ and 2.7% when $\alpha_{t}=0$ , respectively. Besides, we perform the ablation analysis on the setup of the sequence $\{\alpha_{t}\}$ . We compare three types: (1) Decreasing $\alpha_{t}$ . (2) Equal $\alpha_{t}$ . (3) Increasing $\alpha_{t}$ . Among them, Increasing $\alpha_{t}$ achieves the best result. It confirms our hypothesis that a noise-tolerant prototype should be assigned a relatively more substantial proportion if it exhibits a higher degree of similarity to the clean prototype, as discussed in Section 3.2.2.

4.3.4. Effect of the types of noise-tolerant prototypes generation

In this ablation, we consider two types of noise-tolerant prototypes generation: feature based clustering and relevance score based separation. For feature based clustering, k-means is applied to cluster the pre-computed noisy features into 5 groups. We adopt similar hyperparameter configurations as the relevance score based method, and ensure that the noise-tolerant prototype closer to the clean prototype is assigned a higher weight. Table 4 shows the comparison between two ways of noise-tolerant prototypes generation on Low-shot ImageNet. Clustering based method shows worse performance than the relevance score based method. When there are more clean images, the gap tends to be smaller. As discussed in Section 3.2.1, clustering is performed offline and the resulting noise groups are fixed in the graph cleaning stage. The separation is determined by the geometry of the feature space. It cannot embrace the benefit of the learned relations by graph convolutions. This leads to performance degradation. On the contrary, the relevance score based generation exhibits the advantages of adaptively producing the customized noise groups, which are much more robust to the influence of the irrelevant images. Therefore, we apply the relevance score based noise-tolerant prototypes generation in our experiments.

4.3.5. Similarity minimization vs. Similarity maximization

If the noisy set is pretty diverse and composed of many closely relevant images, pulling the noise-tolerant prototypes closer to the less representative clean prototype might result in overfitting issues. We investigate the effect of replacing the similarity maximization by minimization. The similarity minimization loss tries to push the noise-tolerant prototypes away from the clean prototype. Therefore, the final combined prototype is prevented from being too close to the clean prototype. We validate the similarity minimization loss with the prototype classifier under different shot settings. The top-5 accuracy results on Low-shot ImageNet are presented in Table 5. We observe that the performance drops a lot for each shot setting. For 2-shot case, the similarity minimization method is 52% behind the similarity maximization method. When given more and more clean images, the gap narrows. This suggests that more irrelevant images are in the noisy set and our similarity maximum loss can well select those relevant images and assign higher relevance scores to build a more discriminative classifier. Meanwhile, our learned relevance scores can measure the clean-noisy relations better, which can help the generation of a more compact prototype.

Table 4. Effect of noise-tolerant prototypes generation methods. Relevance score based separation exhibits the advantages of producing customized and robust groups adaptively.

ResNet-10 - Few Clean & Many Noisy Data
Methods	TOP-5 ACCURACY ON NOVEL CLASSES
Methods	k=1	k=2	k=5	k=10	k=20
Feature based clustering	58.33 $\pm$ 0.43	66.27 $\pm$ 0.38	73.32 $\pm$ 0.23	75.87 $\pm$ 0.21	78.26 $\pm$ 0.16
Relevance score based separation	67.64 $\pm$ 0.38	70.98 $\pm$ 0.25	75.38 $\pm$ 0.28	76.98 $\pm$ 0.20	78.39 $\pm$ 0.14

Table 5. Comparison between similarity minimization and similarity maximization method. Similarity maximization loss learns better clean-noisy relations that help the generation of compact prototypes.

ResNet-10 - Few Clean & Many Noisy Data
Methods	TOP-5 ACCURACY ON NOVEL CLASSES
Methods	k=1	k=2	k=5	k=10	k=20
Similarity minimization	23.38 $\pm$ 0.60	18.91 $\pm$ 0.44	34.28 $\pm$ 0.99	44.31 $\pm$ 1.04	63.74 $\pm$ 0.65
Similarity maximization	67.64 $\pm$ 0.38	70.98 $\pm$ 0.25	75.38 $\pm$ 0.28	76.98 $\pm$ 0.20	78.39 $\pm$ 0.14

4.4. Visualization and Analysis

In this subsection, we show several qualitative results: (1) visualization of the relevance score distribution. (2) visualization of the representative noisy images.

4.4.1. Distribution of relevance scores

Instead of using binary classification loss to treat the clean data as positive instances and noisy data as negative ones (Iscen et al., 2020), our SimNoiPro directly optimizes the relevance scores for the noise-tolerant prototypes generation in an end-to-end manner. We compare the distribution of the class relevance $r$ generated by our SimNoiPro and the baseline method (Iscen et al., 2020). In this experiment, we divide the interval $r_{max}-r_{min}$ into 10 groups equally and then count the number of noisy data for each group in the first trial under the 5-shot setting.

The visualization results for Low-shot Places365 and Low-shot ImageNet are illustrated in Figure 5 and 6, respectively. In Figure 5, we select three typical classes with higher accuracy obtained by our method on Low-shot Places365. It can be seen that for the baseline method (Iscen et al., 2020), most noisy data fall into the 0-10% interval. This indicates that the binary classification loss used in (Iscen et al., 2020) simply pushes the class relevance scores of the noisy data to be close to 0. There is no guarantee to produce a discriminative unified prototype for classification. However, our SimNoiPro regards each noisy data as one of the components of the noise-tolerant prototypes. As a result, the output of our model reveals the relative importance of noisy features and directly contributes to the unified prototype used in the following classification stage. Specifically, we observe that the distribution of $r$ produced by our method is quite different from the baseline in class 2. In Figure 6, more visualization results on Low-shot ImageNet are presented. We also find that the distribution of the relevance scores for the baseline model is more centralized while our method produces a more diverse distribution.

4.4.2. Qualitative analysis

We show the representative noisy images with the corresponding class relevance scores in the different noise groups produced by our SimNoiPro. The results on Low-shot Places365 and Low-shot ImageNet are presented in Figure 7 and 8. Figure 7 depicts the results for class “wind farm” on Low-shot Places365. In Figure 8, we present the noise groups for class “afghan hound”. Here, noise groups are divided by the class relevance scores. Note that our relevance score can represent the relative importance. For each noise group, we randomly sample several representative noisy images. We notice that each group mostly shares similar visual semantics except in noise group 1 which contains many irrelevant images. Our method can assign higher scores to those that look very similar to the given clean image. These results support our basic motivation in Section 3.2.1. As the noisy set is pretty diverse, using one coarse noise-tolerant prototype can fail to represent the complex noisy data collections. Our method can well model the noise data by introducing multiple prototypes. The generated relevance scores are more plausible.

5. Conclusion

In this paper, we introduce SimNoiPro, a similarity maximization loss, to learn a robust few-shot classifier by leveraging large-scale noisy web data. Our approach is different from previous methods that formulate noise data cleaning as a binary classification problem, which ignores the diverse nature of noisy web images and can lead to a discrepancy issue when applied to prototype-based classification. SimNoiPro introduces noise-tolerant hybrid prototypes to provide finer modeling of the diverse noisy set. It enables end-to-end learning by pulling the noise-tolerant prototypes closer to the clean prototype. We extensively evaluate SimNoiPro on Low-shot Places365 and Low-shot ImageNet, and demonstrate that it outperforms other methods, showcasing its effectiveness.

References

(1)
Arazo et al. (2019) Eric Arazo, Diego Ortego, Paul Albert, Noel O’Connor, and Kevin McGuinness. 2019. Unsupervised label noise modeling and loss correction. In International Conference on Machine Learning (ICML). PMLR, 312–321.
Caron et al. (2020) Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. 2020. Unsupervised Learning of Visual Features by Contrasting Cluster Assignments. In Advances in Neural Information Processing Systems (NIPS), Vol. 33. Curran Associates, Inc., 9912–9924.
Douze et al. (2018) Matthijs Douze, Arthur Szlam, Bharath Hariharan, and Hervé Jégou. 2018. Low-Shot Learning With Large-Scale Diffusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Englesson and Azizpour (2021) Erik Englesson and Hossein Azizpour. 2021. Generalized jensen-shannon divergence loss for learning with noisy labels. Advances in Neural Information Processing Systems (NIPS) 34 (2021), 30284–30297.
Fan et al. (2020) Hehe Fan, Linchao Zhu, Yi Yang, and Fei Wu. 2020. Recurrent attention network with reinforced generator for visual dialog. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 16, 3 (2020), 1–16.
Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning (ICML). PMLR, 1126–1135.
Gidaris and Komodakis (2018) Spyros Gidaris and Nikos Komodakis. 2018. Dynamic Few-Shot Visual Learning Without Forgetting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Hariharan and Girshick (2017) Bharath Hariharan and Ross Girshick. 2017. Low-Shot Visual Recognition by Shrinking and Hallucinating Features. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
He et al. (2022) Jun He, Richang Hong, Xueliang Liu, Mingliang Xu, and Qianru Sun. 2022. Revisiting Local Descriptor for Improved Few-Shot Classification. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 18, 2s, Article 127 (oct 2022), 23 pages.
He et al. (2020) Jun He, Richang Hong, Xueliang Liu, Mingliang Xu, Zheng-Jun Zha, and Meng Wang. 2020. Memory-Augmented Relation Network for Few-Shot Learning. In Proceedings of the 28th ACM International Conference on Multimedia. 1236–1244.
He et al. (2017) Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2961–2969.
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778.
Iscen et al. (2020) Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, Ondřej Chum, and Cordelia Schmid. 2020. Graph convolutional networks for learning with few clean and many noisy labels. In European Conference on Computer Vision (ECCV). Springer, 286–302.
Iscen et al. (2022) Ahmet Iscen, Jack Valmadre, Anurag Arnab, and Cordelia Schmid. 2022. Learning with neighbor consistency for noisy labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4672–4681.
Jiang et al. (2018) Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. 2018. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In International Conference on Machine Learning (ICML). PMLR, 2304–2313.
Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25 (2012), 1097–1105.
Li et al. (2021a) Junnan Li, Caiming Xiong, and Steven Hoi. 2021a. MoPro: Webly Supervised Learning with Momentum Prototypes. In International Conference on Learning Representations (ICLR).
Li et al. (2021b) Junnan Li, Pan Zhou, Caiming Xiong, and Steven Hoi. 2021b. Prototypical Contrastive Learning of Unsupervised Representations. In International Conference on Learning Representations (ICLR).
Li et al. (2024) Wei Li, Hehe Fan, Yongkang Wong, Yi Yang, and Mohan Kankanhalli. 2024. Improve Context Understanding in Multimodal Large Language Models via Multimodal Composition Learning. In International Conference on Machine Learning (ICML).
Liang et al. (2023a) Chao Liang, Zongxin Yang, Linchao Zhu, and Yi Yang. 2023a. Co-Learning Meets Stitch-Up for Noisy Multi-Label Visual Recognition. IEEE Transactions on Image Processing 32 (2023), 2508–2519.
Liang et al. (2023b) Chao Liang, Linchao Zhu, Humphrey Shi, and Yi Yang. 2023b. Combating Label Noise With A General Surrogate Model For Sample Selection. arXiv preprint arXiv:2310.10463 (2023).
Loshchilov and Hutter (2016) Ilya Loshchilov and Frank Hutter. 2016. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016).
Luo et al. (2015) Changzhi Luo, Bingbing Ni, Shuicheng Yan, and Meng Wang. 2015. Image classification by selective regularized subspace learning. IEEE Transactions on Multimedia 18, 1 (2015), 40–50.
Ma et al. (2018) Xingjun Ma, Yisen Wang, Michael E Houle, Shuo Zhou, Sarah Erfani, Shutao Xia, Sudanthi Wijewickrema, and James Bailey. 2018. Dimensionality-driven learning with noisy labels. In International Conference on Machine Learning (ICML). PMLR, 3355–3364.
Ravi and Larochelle (2017) Sachin Ravi and Hugo Larochelle. 2017. Optimization as a Model for Few-Shot Learning. In International Conference on Learning Representations (ICLR).
Redmon et al. (2016) Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR). 779–788.
Reed et al. (2014) Scott Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru Erhan, and Andrew Rabinovich. 2014. Training deep neural networks on noisy labels with bootstrapping. arXiv preprint arXiv:1412.6596 (2014).
Ren et al. (2018) Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. 2018. Learning to reweight examples for robust deep learning. In International Conference on Machine Learning (ICML). PMLR, 4334–4343.
Ren et al. (2015) Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems (NIPS), Vol. 28. Curran Associates, Inc.
Ricci et al. (2023) Simone Ricci, Tiberio Uricchio, and Alberto Del Bimbo. 2023. Meta-Learning Advisor Networks for Long-Tail and Noisy Labels in Social Image Classification. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 19, 5s, Article 169 (jun 2023), 23 pages.
Rusu et al. (2018) Andrei A Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon Osindero, and Raia Hadsell. 2018. Meta-learning with latent embedding optimization. arXiv preprint arXiv:1807.05960 (2018).
Santoro et al. (2016) Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. 2016. One-shot learning with memory-augmented neural networks. arXiv preprint arXiv:1605.06065 (2016).
Shi et al. (2023) Yanyan Shi, Shaowu Yang, Wenjing Yang, Dianxi Shi, and Xuehui Li. 2023. Boosting Few-Shot Object Detection with Discriminative Representation and Class Margin. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) (jul 2023).
Shu et al. (2019) Jun Shu, Qi Xie, Lixuan Yi, Qian Zhao, Sanping Zhou, Zongben Xu, and Deyu Meng. 2019. Meta-weight-net: Learning an explicit mapping for sample weighting. arXiv preprint arXiv:1902.07379 (2019).
Siam et al. (2019) Mennatullah Siam, Boris N Oreshkin, and Martin Jagersand. 2019. Amp: Adaptive masked proxies for few-shot segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 5249–5258.
Snell et al. (2017) Jake Snell, Kevin Swersky, and Richard Zemel. 2017. Prototypical Networks for Few-shot Learning. In Advances in Neural Information Processing Systems (NIPS), Vol. 30. Curran Associates, Inc.
Sun et al. (2022) Zeren Sun, Yazhou Yao, Xiushen Wei, Fumin Shen, Huafeng Liu, and Xian-Sheng Hua. 2022. Boosting Robust Learning via Leveraging Reusable Samples in Noisy Web Data. IEEE Transactions on Multimedia (2022).
Sung et al. (2018) Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip H.S. Torr, and Timothy M. Hospedales. 2018. Learning to Compare: Relation Network for Few-Shot Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Tanaka et al. (2018) Daiki Tanaka, Daiki Ikami, Toshihiko Yamasaki, and Kiyoharu Aizawa. 2018. Joint optimization framework for learning with noisy labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5552–5560.
Tang and Yu (2023) Yiming Tang and Yi Yu. 2023. Query-Guided Prototype Learning with Decoder Alignment and Dynamic Fusion in Few-Shot Segmentation. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 19, 2s, Article 84 (mar 2023), 20 pages.
Thomee et al. (2016) Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. YFCC100M: The new data in multimedia research. Commun. ACM 59, 2 (2016), 64–73.
Vinyals et al. (2016) Oriol Vinyals, Charles Blundell, Timothy Lillicrap, koray kavukcuoglu, and Daan Wierstra. 2016. Matching Networks for One Shot Learning. In Advances in Neural Information Processing Systems (NIPS), Vol. 29. Curran Associates, Inc.
Wang et al. (2023) Xiaohan Wang, Linchao Zhu, Zhedong Zheng, Mingliang Xu, and Yi Yang. 2023. Align and Tell: Boosting Text-Video Retrieval With Local Alignment and Fine-Grained Supervision. IEEE Transactions on Multimedia 25 (2023), 6079–6089.
Xu et al. (2021) Youjiang Xu, Linchao Zhu, Yi Yang, and Fei Wu. 2021. Training robust object detectors from noisy category labels and imprecise bounding boxes. IEEE Transactions on Image Processing 30 (2021), 5782–5792.
Yang et al. (2021b) Yi Yang, Yueting Zhuang, and Yunhe Pan. 2021b. Multiple knowledge representation for big data artificial intelligence: framework, applications, and case studies. Frontiers of Information Technology & Electronic Engineering 22, 12 (2021), 1551–1558.
Yang et al. (2024a) Zongxin Yang, Guikun Chen, Xiaodi Li, Wenguan Wang, and Yi Yang. 2024a. DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent). In International Conference on Machine Learning (ICML).
Yang et al. (2024b) Zongxin Yang, Jiaxu Miao, Yunchao Wei, Wenguan Wang, Xiaohan Wang, and Yi Yang. 2024b. Scalable video object segmentation with identification mechanism. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024).
Yang et al. (2021a) Zongxin Yang, Yunchao Wei, and Yi Yang. 2021a. Collaborative video object segmentation by multi-scale foreground-background integration. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 9 (2021), 4701–4712.
Yi and Wu (2019) Kun Yi and Jianxin Wu. 2019. Probabilistic end-to-end noise correction for learning with noisy labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 7017–7025.
Zhai et al. (2022) Deming Zhai, Ruifeng Shi, Junjun Jiang, and Xianming Liu. 2022. Rectified Meta-Learning from Noisy Labels for Robust Image-Based Plant Disease Classification. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 18, 1s, Article 30 (jan 2022), 17 pages.
Zhang et al. (2016) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2016. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530 (2016).
Zhang et al. (2020) Chi Zhang, Yujun Cai, Guosheng Lin, and Chunhua Shen. 2020. DeepEMD: Few-Shot Image Classification With Differentiable Earth Mover’s Distance and Structured Classifiers. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
Zhang et al. (2024) Yue Zhang, Hehe Fan, and Yi Yang. 2024. Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for Multimodal Large Language Models. arXiv (2024).
Zhao et al. (2024) Shuai Zhao, Xiaohan Wang, Linchao Zhu, and Yi Yang. 2024. Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models. In International Conference on Learning Representations (ICLR).
Zhou et al. (2017) Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2017. Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence 40, 6 (2017), 1452–1464.
Zhu et al. (2022) Linchao Zhu, Hehe Fan, Yawei Luo, Mingliang Xu, and Yi Yang. 2022. Temporal Cross-Layer Correlation Mining for Action Recognition. IEEE Transactions on Multimedia 24 (2022), 668–676.

Methods	TOP-5 ACCURACY
Methods	ON NOVEL CLASSES
$\beta$ = 0	66.59 $\pm$ 0.43
$\alpha_{t}$ = 0	72.68 $\pm$ 0.31
Decreasing $\alpha_{t}$	73.86 $\pm$ 0.28
Equal $\alpha_{t}$	74.70 $\pm$ 0.22
Increasing $\alpha_{t}$	75.38 $\pm$ 0.28