This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Unleash Model Potential: Bootstrapped Meta Self-supervised Learning

Jingyao Wang, Zeen Song, Wenwen Qiang*, Changwen Zheng
Institute of Software Chinese Academy of Sciences
{wangjingyao2023, songzeen, qiangwenwen, changwen}@iscas.ac.cn
Abstract

The long-term goal of machine learning is to learn general visual representations from a small amount of data without supervision, mimicking three advantages of human cognition: i) no need for labels, ii) robustness to data scarcity, and iii) learning from experience. Self-supervised learning and meta-learning are two promising techniques to achieve this goal, but they both only partially capture the advantages and fail to address all the problems. Self-supervised learning struggles to overcome the drawbacks of data scarcity, while ignoring prior knowledge that can facilitate learning and generalization. Meta-learning relies on supervised information and suffers from a bottleneck of insufficient learning. To address these issues, we propose a novel Bootstrapped Meta Self-Supervised Learning (BMSSL) framework that aims to simulate the human learning process. We first analyze the close relationship between meta-learning and self-supervised learning. Based on this insight, we reconstruct tasks to leverage the strengths of both paradigms, achieving advantages i and ii. Moreover, we employ a bi-level optimization framework that alternates between solving specific tasks with a learned ability (first level) and improving this ability (second level), attaining advantage iii. To fully harness its power, we introduce a bootstrapped target based on meta-gradient to make the model its own teacher. We validate the effectiveness of our approach with comprehensive theoretical and empirical study.

1 Introduction

Humans are able to understand the world with three advantages [35, 8]: i) no need for supervised information; ii) only a small number of samples is required for recognizing a classification task; and iii) learning based on the existing prior knowledge of the world. Correspondingly, the ultimate aim of machine learning is to leverage prior knowledge and learn representations that can transfer across different tasks without requiring any supervision, even when the data is scarce.

Self-supervised learning (SSL) is a promising approach to achieve this goal, as it can learn general representations without supervision and generalize to downstream tasks [28, 44, 46, 4]. SSL applies various data augmentations [45, 56] to generate different views of the same image and encourages them to have similar embeddings while being dissimilar from views obtained by other images. SSL has been considered as a close approximation to human learning in machine learning [1, 33, 21]. However, we argue that it only partially captures the first advantage of human learning and fails to address the other two. Specifically, we show that data augmentation cannot fully compensate for the lack of data diversity and may even harm the performance when overused (see Figure 2). Moreover, we point out that SSL relies on a single data-based fixed prior, e.g. data distribution need to satisfy uniform distribution, which may introduce bias when data is scarce and limit the adaptability to new scenarios. Therefore, SSL still faces significant challenges in overcoming the low-data barrier and incorporating flexible prior knowledge as humans do.

Refer to caption
Figure 1: Accuracies with linear evaluation for different scales of data and augmentation. "SimCLR-NNx" means that multiple data enhancements are used randomly to expand the data to NN times.
Refer to caption
Figure 2: Restructuring SSL data into meta-Learning tasks. Views augmented from the same "cat" with the Anchor are positive (same class), while the views of other samples (the “car” and the “leaf”) are negative (different class).

Meta-learning is another promising approach for this goal that aims to overcome the low-data barrier by learning to quickly adapt to new tasks with limited data [20, 10, 18, 48, 51]. Meta-learning mimics human learning by using a double loop structure: the inner loop optimizes a task-specific model based on the current learning ability, and the outer loop updates this ability based on the feedback from multiple tasks. However, meta-learning still relies on supervision to update the ability, which violates the first advantage of human learning. Some recent works attempt to use self-supervised pseudo-labels for meta-learning to avoid supervision [2, 13, 5], but this strategy is computationally expensive and may not generalize well to different tasks. Furthermore, meta-learning suffers from two limitations: i) inner loop: the task-specific model ff is only updated for LL steps, which may not capture the optimal state for the task; ii) outer loop: the learning ability update is based on the same objective and geometry as ff, which may propagate the errors of ff and compromise the final results. Therefore, meta-learning still has a large gap from achieving human-like learning capabilities.

In this work, we reveal a surprising connection between self-supervised learning and meta-learning: they can be unified by viewing the batch of classes sampled in meta-learning as the augmented views of samples generated in self-supervised learning. Based on this insight, we propose a novel Bootstrapped Meta Self-Supervised Learning framework (BMSSL) that aims to emulate the three advantages of human learning. To achieve the first and second advantages, we present a simple and general method to restructure tasks that can leverage both self-supervised learning and meta-learning, overcoming the limitations of labels and data volume. Figure 2 briefly illustrates our idea. To achieve the third advantage, we use a bi-level meta-learning structure with gradient-based optimization to update the initial parameters based on feedback from multiple tasks. To address the existing bottlenecks of meta-learning, we further introduce a bootstrapped target based on meta-gradient to allow the model to learn from itself. Our contributions are as follows:

  • We discuss the close relationship between meta-learning and self-supervised learning, and experimentally verify the feasibility of our inductive ideas.

  • We propose a novel Bootstrapped Meta Self-Supervised framework (BMSSL) to simulate the learning process of human.

  • We conduct theoretical analysis and empirical study on our proposed framework to verify its effectiveness.

2 Related Work

Self-supervised learning. Self-supervised learning (SSL) enables learning of general visual representations by imposing an additional constraint between different views of raw input data without accessing any annotated data [12, 32, 3, 16, 50]. Recently, mainstream approaches rely on constructing positive and negative viewpoints for examples through augmentations [42, 21, 42] to learn pretext tasks and generalize to downstream tasks: positive samples are typically augmented views of the same reference instance, while negative examples are defined as any view from a different instance. The learning process is based on the pioneering NCE method [38], which uses contrastive loss [63, 64] to enforce discrimination between positive and negative viewpoints of each instance, leading to the learning of useful semantics. The learned model provides visually discriminative representations uniformly scattered in the feature space. However, these methods are difficult to generalize when data is scarce, and learn based on only one stage of data instead of quickly adapting to new tasks from experience with multiple learning processes like humans.

Meta-learning. Meta-learning aims to learn a model that can quickly adapt to new tasks with limited data and generalize to unseen examples. The meta-learning methods can be divided into two categories: i) learn the optimal initialization to adapt to new tasks quickly [14, 37, 23]; ii) learn a shared embedding space and amortizing inference [52, 47, 49, 62]. Recently, meta-learning has achieved superior performance in various applications, such as few-shot classification [58, 41], reinforcement learning [40, 60], and hyperparameter optimization [59]. These models cleverly design tasks that rely on a few labeled samples to learn a general visual representation unit, but often fails to provide reliable uncertainty estimates when only a few meta-training tasks or on supervision is provided [34, 20]. Some methods adopt a patchwork approach to solve this problem: using unsupervised model to construct pseudo-labels, then use them as supervision for meta-learning. However, although this approach can learning representations from limited data without human priors, it leads to huge computational resources while be difficult to guarantee accuracy [39]. Meanwhile, due to the myopia and curvature limitations of meta-learning [15], it still cannot meet expectations about simulating human learning.

3 The Relationship between Meta-Learning and Self-supervised Learning

In this section, we study the close relationship between self-supervised learning and meta-learning. We first review the learning paradigms of both frameworks and highlight their similarities from three perspectives. Based on these insights, we propose a simple meta-learning-based approach to reconstruct self-supervised tasks and evaluate its effectiveness experimentally.

The training procedures of meta-learning and self-supervised learning are shown in Figure 3. For self-supervised learning: i) sample {x}iN\{x\}_{i}^{N} from the distribution of the raw data space 𝒳\mathcal{X}; ii) apply multiple data augmentations to {x}iN\{x\}_{i}^{N}, e.g., random scaling, rotation, and cropping, obtaining 𝒜={aix1,,aixN}i=1M\mathcal{A}=\{a_{i}^{x_{1}},\dots,a_{i}^{x_{N}}\}_{i=1}^{M}; iii) learn a general visual representation based on minimizing self-supervised learning loss, e.g., contrastive loss; iv) use pre-trained models to extract features and verify them in downstream tasks. For meta-learning: i) sample xiN{x}_{i}^{N} from the distribution of the raw data space XX; ii) construct tasks {T}iK\{T\}_{i}^{K} by partitioning xiN{x}_{i}^{N}; iii) find the best initial parameters based on meta loss (outer loop): minimizes the cumulative gradient loss for all of the task-specific model (inner loop); iv) use the trained model for fast adaptation to new tasks. Conceptually, we find that the training process of self-supervised learning is similar to meta-learning and includes several similarities:

  • Both aim to learn generalizable representations and quickly adapt to new tasks: self-supervised learning aims to discriminate between unseen images; meta learning aims to discriminate between unseen tasks.

  • Both learn a fixed amount of information that can be transferred to new tasks: self-supervised learning leverages the similarity and the dissimilarity among multiple views of the samples; meta-learning leverages the similarity of instances within each task.

  • Both use batches as units of data processing: self-supervised learning treats all views generated by a single augmentation as a batch; meta learning treats each task which consists of KK nn-way-mm-shot tasks as a batch.

Refer to caption
Figure 3: (a) Training procedure of contrastive learning. Augmented views from images XX are generated by applying random transformations to the same input batch, obtain corresponding pseudo-labels YY. HH and PP are vectors through a backbone for representations and a projector learned through contrastive prediction tasks. (b) Training procedure of meta-learning. Its input is labeled data ZZ, and learns a task-specific model based on a learnable initialization parameter θ\theta. The subsequent processing of each task is similar to self-supervised learning, but θ\theta is updated based on the cumulative gradient of all tasks.

Inspired by this, we propose a general paradigm to unify self-supervised learning and meta-learning. The idea is to transform self-supervised learning into a task distribution that is suitable for meta-learning optimization and learn from it. The paradigm consists of the following steps: i) randomly sample a batch of NN input images {xi}i=1N𝒳\left\{x_{i}\right\}_{i=1}^{N}\in\mathcal{X} from the candidate pool; ii) divide 𝒳\mathcal{X} into KK blocks, each block contains N/KN/K images; iii) apply multiple data augmentations on x𝒳x\in\mathcal{X} of each blocks, obtaining a𝒜a\in\mathcal{A}; iv) create an NN-way classification problem for each block: all data generated from the same xix_{i} is regarded as a category, with z={a,y}𝒵z=\{a,y\}\in\mathcal{Z} as the data and y𝒴y\in\mathcal{Y} as the pseudo-labels; v) integrate tasks corresponding to KK blocks in a batch, obtaining the task distribution p(𝒯)p(\mathcal{T}); vi) update the task-specific model ww for task TiT_{i} based on the meta-initialized parameters θζ\theta\in\zeta by calculating the loss l()l(\cdot) (inner loop); vii) use the meta-objective function (,θ)\mathcal{L}(\cdot,\theta) to update θ\theta, which is calculate through argmin1Ki=1Kl(wi)\arg\min\frac{1}{K}\sum_{i=1}^{K}l(w_{i}) (outer loop). Before further extending this paradigm, we state the assumptions needed for this paradigm. Theoretical analysis details can be found in Section 4.3 and Appendix A.

Assumption 3.1

We assume that 𝒵\mathcal{Z} is a Polish space (i.e., complete, separable, and metrizable), and for any ii, p(Ti)p(T_{i}) is a non-atomic probability distribution on (𝒵,)(\mathcal{Z},\mathcal{L}), where (,θ)\mathcal{L}(\cdot,\theta) is a Borel σ\sigma-algebra on 𝒵\mathcal{Z}.

Assumption 3.2

The ground truth parameter θ\theta^{*} is independent of 𝒳\mathcal{X} and satisfies Cov[θ]=(R2/d)Id\mathrm{Cov}[\theta^{*}]=(R^{2}/d)\mathrm{}I_{d}, where RR is a constant and dd is the dimension of the model parameter.

Assumption 3.3

For any z𝒵z\in\mathcal{Z}, the function (,θ)\mathcal{L}(\cdot,\theta) is twice continuously differentiable, and satisfies the following properties for any z𝒵z\in\mathcal{Z}, wi,wjdw_{i},w_{j}\in\mathbb{R}^{d}:

  • The function (,θ)\mathcal{L}(\cdot,\theta) is KK-bounded over 𝒲\mathcal{W} with the gradient norm uniformly bounded by GG, i.e., l(z,wi)G\left\|\bigtriangledown l(z,w_{i})\right\|\leq G;

  • The function (,θ)\mathcal{L}(\cdot,\theta) is L-smooth over d\mathbb{R}^{d}, i.e., l(z,wi)l(z,wj)L(wiwj)0\left\|\bigtriangledown l(z,w_{i})-\bigtriangledown l(z,w_{j})\right\|-L(w_{i}-w_{j})\leq 0;

  • The function (,θ)\mathcal{L}(\cdot,\theta) is μ\mu-strongly convex, i.e., l(z,wi)l(z,wj)μ(wiwj)0\left\|\bigtriangledown l(z,w_{i})-\bigtriangledown l(z,w_{j})\right\|-\mu(w_{i}-w_{j})\geq 0.

To evaluate the feasibility of the proposed general paradigm, we compare the performance of typical self-supervised frameworks, SimCLR [9], Barlow Twins [57], and MoCo [17], in learning representations under different settings. We measure the top-1 accuracies on the ImageNet1K [53] and CIFAR-10 [29] datasets without imposing any data restrictions, where pure self-supervised frameworks can perform well. We expect that the variants of the model based on the proposed paradigm can benefit from it or at least not degrade in performance in such environments. The results in Tables 3 and 3 confirm this expectation. We observe that the self-supervised learning representations based on the proposed paradigm can achieve comparable or even better performance than the original frameworks on both datasets. Moreover, since the improvement is limited in this setting, we conduct a more comprehensive assessment from multiple perspectives in Section 5.

Table 1: Accuracies(%) on ImageNet1K on SSL baselines ("-o" means using the meta-learning-based setting, while "-x" means not).
Method Backbone Top-1 Acc
SimCLR-x ResNet-50 64.561
SimCLR-o ResNet-50 65.156
Barlow Twins-x ResNet-50 66.561
Barlow Twins-o ResNet-50 66.952
Moco-x ResNet-50 59.382
Moco-o ResNet-50 59.156
Table 2: Accuracies(%) on CIFAR-10 on SSL baselines ("-o" means using the meta-learning-based setting, while "-x" means not).
Method Backbone Top-1 Acc
SimCLR-x ResNet-18 89.416
SimCLR-o ResNet-18 91.516
Barlow Twins-x ResNet-18 92.513
Barlow Twins-o ResNet-18 92.789
Moco-x ResNet-18 82.465
Moco-o ResNet-18 83.165

4 A Bootstrapped Meta Self-Supervised Learning Framework

In this section, we introduce the proposed Bootstrapped Meta Self-Supervised Learning framework (BMSSL), which is inspired by the human cognitive model. Our key idea includes: i) construct few-shot classification tasks using a multi-view queue generated by data augmentation in self-supervised learning (Section 4.1); ii) use a gradient-based two-level optimization structure to learn a universal visual representation (Section 4.2): the inner loop use contrastive learning to learn specific task (first optimization); the outer loop find the optimal initial parameters for inner loop with bootstrapped meta gradient through minimizing the distance to the self-bootstrapped objective, which allows the model to become its own teacher (second optimization). The pseudocode of our framework is provided in Appendix A. It is worth noting that BMSSL is designed based on the observation in Section 3, thus adopting consistent notation and settings.

4.1 Online Task Construction

We now describe how to re-construct self-supervised task based on meta-learning for BMSSL. Firstly, we randomly select NN unlabelled data x𝒳x\in\mathcal{X} from a candidate pool of training data 𝒟pool\mathcal{D}_{pool} to form 𝒟={xi}i=1N\mathcal{D}=\{x_{i}\}_{i=1}^{N}. Secondly, we apply data augmentation to 𝒟\mathcal{D}, obtaining 𝒟^={{aix1}i=1M,{aix2}i=1M,,{aixN}i=1M}\hat{\mathcal{D}}=\{\{a_{i}^{x_{1}}\}_{i=1}^{M},\{a_{i}^{x_{2}}\}_{i=1}^{M},…,\{a_{i}^{x_{N}}\}_{i=1}^{M}\}, where {aixj}i=1M\{a_{i}^{x_{j}}\}_{i=1}^{M} is the result of applying MM times augmentation on xjx_{j}. Next, we divide 𝒟^\hat{\mathcal{D}} into KK blocks 𝒟^={𝒟^1,,𝒟^K}\hat{\mathcal{D}}=\left\{\hat{\mathcal{D}}_{1},...,\hat{\mathcal{D}}_{K}\right\}, and each block contains (N/K)×M(N/K)\times M images. For humans, different images of the same object are easily categorized because of their high similarity. Similarly, as there is significant entity similarity in data augmented from the same image, we assign the same label yj𝒴y_{j}\in\mathcal{Y} to the augmented data of the same image xjx_{j} to simulate human classification thinking. We use z={a,y}𝒵z=\{a,y\}\in\mathcal{Z} to represent the data that is categorized in the above way, where yy represents the label of the augmented sample aa of xx, i.e., the dataset 𝒟^={zi}i=1N×M\hat{\mathcal{D}}=\{z_{i}\}_{i=1}^{N\times M}. Based on the analysis in Section3, each block in 𝒟^i\hat{\mathcal{D}}_{i} can be regarded as a N/KN/K-way task 𝒯i\mathcal{T}_{i}, where 𝒟^={𝒯i}i=1K\hat{\mathcal{D}}=\{\mathcal{T}_{i}\}_{i=1}^{K}. Take 𝒟^1={zi}i=1(N/K)×M\hat{\mathcal{D}}_{1}=\{z_{i}\}_{i=1}^{(N/K)\times M} as an example, we divide 𝒟^1\hat{\mathcal{D}}_{1} into two parts for the support set 𝒟^1s\hat{\mathcal{D}}_{1}^{s} and query set 𝒟^1q\hat{\mathcal{D}}_{1}^{q}, respectively, each containing (N/K)×M1(N/K)\times M_{1} and (N/K)×M2(N/K)\times M_{2} images (M1+M2=MM_{1}+M_{2}=M). This approach transforms a batch of unlabeled data into KK N/KN/K-way-M1M_{1}-shot classification tasks which realizes the advantages i and ii of human learning.

4.2 Bootstrapped Meta Self-supervised Learning

We now describe how to use constructed few-shot tasks for meta-learning. As humans, we learn specific knowledge and build mind maps by going through two levels of abstraction, allowing us to quickly recognize similar things. Therefore, we focus on gradient-based methods [14] that use two-level loops to constrain updates for task-specific and meta-learners to obtain optimal initial parameters θ\theta, which is analogous to human experience for faster adaptation to new tasks.

For the inner loop (first optimization), the learner’s objective is to learn f(w)f(w) by minimizing the learning objective [l(f(w);θ,𝒟)]\left[l(f(w);\theta,\mathcal{D})\right] over the training data 𝒟\mathcal{D}, where 𝒟\mathcal{D} represents a mini-batch of unlabeled data source, ff represents a neural network that consists of the feature extractor, projection head (contrastive learning), and classifier (cross entropy). We denote the parameter of f(w)f(w) as ww. Based on Subsection 4.1, 𝒟\mathcal{\mathcal{D}} can be extended to KK N/KN/K-way-M1M_{1}-shot tasks. Considering the impact of label generation errors, the task-specific loss function consists of the cross-entropy loss lce()l_{ce}(\cdot) and the contrastive loss lcl()l_{cl}(\cdot), expressed as:

l(f(w);θ,𝒟)=lce(f(w);θ,𝒟)+λlcl(f(w);θ;𝒟)s.t.lce()=j=1j=Ni=1i=Myjlog𝑎ixjlcl()=j=1,i=1j=N,i=Mlogr=1,rir=Mexp(sim(𝑎ixj,𝑎rxj)/τ)r=1,rir=Mexp(sim(𝑎ixj,𝑎rxj))+p=1,pjNo=1o=Mexp(sim(𝑎ixj,𝑎pxo)/τ)\begin{array}[]{l}l(f(w);\theta,\mathcal{D})=l_{ce}(f(w);\theta,\mathcal{D})+\lambda l_{cl}(f(w);\theta;\mathcal{D})\\ s.t.\quad l_{ce}(\cdot)=-\sum\nolimits_{j=1}^{j=N}{\sum\nolimits_{i=1}^{i=M}{{y_{j}}\log{\mathop{a}\limits^{\cdot}}_{i}^{{x_{j}}}}}\\ l_{cl}(\cdot)=-\sum\nolimits_{j=1,i=1}^{j=N,i=M}{\log\frac{{\sum\nolimits_{r=1,r\neq i}^{r=M}{\exp(sim({\mathop{a}\limits^{*}}_{i}^{{x_{j}}},{\mathop{a}\limits^{*}}_{r}^{{x_{j}}})/\tau)}}}{{\sum\nolimits_{r=1,r\neq i}^{r=M}{\exp(sim({\mathop{a}\limits^{*}}_{i}^{{x_{j}}},{\mathop{a}\limits^{*}}_{r}^{{x_{j}}}))}+\sum\nolimits_{p=1,p\neq j}^{N}{\sum\nolimits_{o=1}^{o=M}{\exp(sim({\mathop{a}\limits^{*}}_{i}^{{x_{j}}},{\mathop{a}\limits^{*}}_{p}^{{x_{o}}})/\tau)}}}}}\end{array} (1)

where λ\lambda describes the importance of lcl()l_{cl}(\cdot), 𝑎ixj{\mathop{a}\limits^{\cdot}}_{i}^{{x_{j}}} is the output of classifier, and 𝑎ixj{\mathop{a}\limits^{*}}_{i}^{{x_{j}}} is the output of projection head. Finally, based on an initial parameter θ0\theta^{0} of ww, we train a task-specific model with learning rate α\alpha and obtain the weights w=θ0αwl(w;θ0,𝒟)w=\theta^{0}-\alpha\bigtriangledown_{w}l(w;\theta^{0},\mathcal{D}).

For the outer loop (second optimization), the aim of meta-learner is to learn the model FF, e.g., F(𝒟)=f(w)F(\mathcal{D})=f(w), which means generating the optimal task-specific model f(w)f(w) given dataset 𝒟\mathcal{D}. In machine learning, ff needs to be obtained through a series of gradient descent updates based on the loss function l()l(\cdot), starting from an initial parameter, and it is difficult to reach the optimal result in a single step. Therefore, the objective of the meta-learner is transformed into finding a learnable optimal initialization θζ\theta\in\zeta to let f(w)f(w) to quickly obtain the optimal parameter w𝒲w\in\mathcal{W}. Thus, meta-learning formulate the objective function of learning FF as minθl(f(w(θ));θ,𝒟),s.t.,w(θ)=argminwl(f(w);θ,𝒟){\min_{\theta}}l(f({w^{*}}(\theta));\theta,\mathcal{D}),s.t.,{w^{*}}(\theta)=\arg{\min_{w}}l(f(w);\theta,\mathcal{D}). We can see that the standard meta-gradient firstly optimizes constraint condition w(θ)=argminwl(f(w);θ,𝒟){w^{*}}(\theta)=\arg{\min_{w}}l(f(w);\theta,\mathcal{D}) by performing LL steps updates and then evaluating w(θ){w^{*}}(\theta) under l(f(w(θ));θ,𝒟)l(f({w^{*}}(\theta));\theta,\mathcal{D}), thus obtaining the update rule of θ\theta:

θ=θβθl(f(w(θ));θ,𝒟)\theta^{{}^{\prime}}=\theta-\beta\bigtriangledown_{\theta}l(f({w^{*}}(\theta));\theta,\mathcal{D}) (2)

However, although this method can introduce prior experience θ\theta like humans and quickly adapt to new tasks, the learning process of θ\theta still has limitations: i) it highly depends on f(w)f(w), while human thinking is unrestricted; ii) it is based on updates within a limited number of steps LL, while human learning is extendable. Therefore, considering that human induction is based on entity similarity, we convert this experiential learning into a metric in the model: using the bootstrapped target to move wL(θ){w^{L}}(\theta) closer to its L+δL+\delta version wL+δ(θ){w^{L+\delta}}(\theta). We regard wL(θ){w^{L}}(\theta) and wL+δ(θ){w^{L+\delta}}(\theta) as two discrete uniform distributions with respect to their constituent units. To expand the perspective of updating θ\theta, for the outer loop, we use KL divergence to bring the distribution πw(L)\pi_{w^{(L)}} obtained by wL(θ){w^{L}}(\theta) step closer to the distribution πw\pi_{\vec{w}} obtained by wL+δ(θ){w^{L+\delta}}(\theta) to bootstrapped state wL(θ){w^{L}}(\theta) extended to wL+δ(θ){w^{L+\delta}}(\theta), thereby encouraging the meta-learner to reach future states on its trajectory faster. We have:

θ=θβθDKL(πw,πwL)\begin{array}[]{l}\vec{\theta}=\theta-\beta\bigtriangledown_{\theta}D_{KL}(\pi_{\vec{w}},\pi_{w^{L}})\\ \end{array} (3)

where wL(θ){w^{L}}(\theta) is to update LL step based on w(θ)=argminwl(f(w);θ,𝒟){w^{*}}(\theta)=\arg{\min_{w}}l(f(w);\theta,\mathcal{D}), while wL+δ(θ){w^{L+\delta}}(\theta) is to update L+δL+\delta step.

This measure eliminates the dependence of the outer update θ\theta on the task-specific model ff, and makes πwL\pi_{w^{L}} continuously approach πw\pi_{\vec{w}} that contains future information to achieve convergence and break through the limitation of limited updates. This paradigm makes the model its own teacher.

4.3 Theoretical Analysis

We now conduct a theoretical analysis of our BMSSL for performance guarantees. We defer all proofs and more analysis to Appendix A.

First, consider task construction described in Subsection 4.1, the goal is to identify a representation that allows us to approximate many different "reasonable" choices by gg. It can group augmented views of similar entities together, i.e. every gg that satisfies the following assumption:

Assumption 4.1

(Approximate View-Invariance): The best estimate of the label yy is approximately invariant to the choice of different augmented views aa of the same xx. Each target function g:𝒜ng:\mathcal{A}\to\mathbb{R}^{n} satisfies:

𝔼p+(a1,a2)[(g(a1)g(a2))2]ε\mathbb{E}_{p_{+}(a_{1},a_{2})}\left[(g(a_{1})-g(a_{2}))^{2}\right]\leq\varepsilon (4)

where p+(a1,a2)=xp(a1|x)p(a2|x)p(x)p_{+}(a_{1},a_{2})=\sum_{x}p(a_{1}|x)p(a_{2}|x)p(x) when fixed ε[0,)\varepsilon\in[0,\infty).

We can then constrain the error of approximating gg with a small subset of eigenfunctions by constraining each coefficient according to its contribution to the total positive pair difference. We focus on a class of linear predictors on top of kk-dimensional representations r:𝒜nr:\mathcal{A}\to\mathbb{R}^{n}, among which the representation rd={p1(a),p2(a),,pd(a)}r^{d}=\left\{p_{1}(a),p_{2}(a),...,p_{d}(a)\right\} contains dd eigenfunctions of the positive-pair Markov chain with the largest eigenvalues is the best choice and used for task-specific training (Equation 1).

Theorem 4.2

(Task-specific Performance Guarantee) Let 𝒢ε\mathcal{G}_{\varepsilon} be the functions satisfying Assumption 4.1, and 𝒢r={ag^ν(a)=νTr(a)}\mathcal{G}_{r}=\left\{a\mapsto\hat{g}_{\nu}(a)=\nu^{T}r(a)\right\} be the subspace of linear predictors which maximizes the view invariance of the least-invariant unit-norm predictor 𝒢rd\mathcal{G}_{r^{d}} with implicit regularization effect:

𝒢rd=argmin𝒢maxg^𝒢𝔼p+(a1,a2)[(g^(a1)g^(a2))2],s.t. dim(𝒢)=d,𝔼[g^(a)2=1]\mathcal{G}_{r^{d}}=\mathop{\arg\min}_{\mathcal{G}}\mathop{\max}_{{\hat{g}\in\mathcal{G}}}\mathbb{E}_{p_{+}(a_{1},a_{2})}\left[(\hat{g}(a_{1})-\hat{g}(a_{2}))^{2}\right],\quad\text{s.t. }\text{dim}(\mathcal{G})=d,\ \mathbb{E}[\hat{g}(a)^{2}=1] (5)

it is implied in 𝒢\mathcal{G}, which minimizes the approximation error of the worst-case objective function ff:

𝒢rd=argmin𝒢maxg𝒢εming^𝒢𝔼p+(a1,a2)[(g^(a1)g^(a2))2]\mathcal{G}_{r^{d}}=\mathop{\arg\min}_{\mathcal{G}}\mathop{\max}_{g\in\mathcal{G}_{\varepsilon}}\mathop{\min}_{{\hat{g}\in\mathcal{G}}}\mathbb{E}_{p_{+}(a_{1},a_{2})}\left[(\hat{g}(a_{1})-\hat{g}(a_{2}))^{2}\right] (6)

Theorem 4.2 states that the function class we built (Equation 1) is the best choice for the least squares approximation that satisfies Assumption 4.1. Next, we turn to bootstrapped meta-training in outer loop, where we take Assumption 3.1-3.3 mentioned in Section 3.

Theorem 4.3

(Bootstrapped Meta-training Performance Guarantee) Let ww and θ\theta be given by Equations mentioned in Section 4.2, the update process satisfies:

f(wL(θ))f(wL(θ))=βα(DKL(w,wLαwf(wL))DKL(w,wL))+o(β(α+β))f(w^{L}(\vec{\theta}))-f(w^{L}(\theta))=\frac{\beta}{\alpha}(D_{KL}(\vec{w},w^{L}-\alpha\nabla_{w}f(w^{L}))-D_{KL}(\vec{w},w^{L}))+o(\beta(\alpha+\beta)) (7)

Let θ\vec{\theta} and θ\theta^{\prime} be given by Equation (3) and (2) respectively, f(wL(θ))f(wL(θ))0f(w^{L}(\vec{\theta}))-f(w^{L}(\theta))\leq 0 when (α,β)(\alpha,\beta) sufficiently small, while f(wL(θ))f(wL(θ))0f(w^{L}(\theta^{\prime}))-f(w^{L}(\theta))\leq 0 when β\beta sufficiently small. With the state w\vec{w} bootstrapped from wLw^{L} with δ\delta steps which offer future distribution (better), the update process will turn into:

f(wL(θ))f(wL(θ))=βαDKL(w,wL)+o(β(α+β))<0f(w^{L}(\vec{\theta}))-f(w^{L}(\theta))=-\frac{\beta}{\alpha}D_{KL}(\vec{w},w^{L})+o(\beta(\alpha+\beta))<0 (8)

Therefore, compared to standard meta-learning, BMSSL enables models to reach optimal results faster while achieving convergence without utilizing gradient updates.

5 Empirical Study

In this section, we conduct several experiments to benchmark and analyze BMSSL, including standard self-supervised few-shot learning (Subsection 5.1), cross-domain self-supervised few-shot learning (Subsection 5.2), and ablation study (Subsection 5.3). The details of implementation are available at Appendix B. We omit the confidence intervals in this section for clarity, and the full results with them are provided in Appendix E. Our main objective is to demonstrate the effectiveness of BMSSL by exploring two key questions: i) Can BMSSL be applied successfully in self-supervised few-shot classification scenarios and achieve superior generalization performance? ii) By simulating the way humans learn, can BMSSL achieve more robust learning outcomes?

5.1 Standard Self-supervised Few-shot Learning

Setup. We evaluate BMSSL on three standard few-shot benchmarks of unsupervised meta-learning: Omniglot [30], miniImageNet [52], and CIFAR-FS [6]. Following [22], we compare the performance of BMSSL with unsupervised meta-learning methods [19, 24, 25, 31, 26, 22], self-supervised learning methods [9, 57, 17, 7], and supervised meta-learning methods [14, 47, 43]. To explore the effect of simulating the way humans learn, we introduce the standard meta self-supervised learning paradigm (MetaSSL) mentioned in Section 4, which is also a two-layer structure but not optimized by bootstrapped target. See Appendix C and D for details on benchmarks and baselines.

Results. Table 3 presents the results of few-shot classification on various (way, shot) tasks for the three benchmark datasets mentioned above. We obtain the following three observations: i) outstanding performance: BMSSL achieves outstanding performance on all three benchmarks, surpassing previous unsupervised meta-learning SOTA models. For instance, we obtain an accuracy gain of 3.763% in the 20-way-1-shot test. ii) improve generalization: its performance is competitive even in the unsupervised setting compare to supervised meta-learning and self-supervised baselines. iii) effectiveness of simulate human learning: we compare our approach with standard MetaSSL and achieve an average gain of 4.045%.

Table 3: Accuracy (%) of standard few-shot classification on Omniglot, miniImageNet, and CIFAR-FS benchmarks. The values in this table are average accuracies over 2000 "(way, shot)" tasks for BMSSL and baselines following [22]. Bold entries indicate the best for unsupervised tasks. Blue indicates our BMSSL before and after introducing bootstrapped mentioned in Section 4.2.
Method Omniglot miniImageNet CIFAR-FS
(5,1) (5,5) (20,1) (5,1) (5,5) (20,1) (5,1) (5,5) (20,1)
Train from scratch 50.29 72.82 26.20 24.20 38.84 16.29 31.12 44.89 20.32
Unsupervised Meta-learning
CACTUs[19] 65.29 86.25 49.54 39.32 53.54 31.99 40.02 58.16 35.88
UMTRA[24] 83.32 94.23 75.84 39.23 51.78 30.27 41.61 60.55 37.10
LASIUM[25] 82.38 95.11 70.23 42.12 54.98 34.26 45.33 62.65 38.40
Meta-SVEBM[26] 87.07 94.13 73.33 44.74 58.38 39.71 47.24 63.10 40.10
Meta-GMVAE[31] 90.89 96.05 81.51 42.28 56.97 39.83 47.45 63.20 41.55
PsCo[22] 96.18 98.22 89.32 46.35 63.05 40.84 51.77 69.66 45.08
MetaSSL 91.36 95.35 88.64 45.82 62.14 39.48 49.09 66.54 43.70
BMSSL 96.02 99.56 91.41 49.98 64.59 45.28 52.20 69.64 49.84
Self-supervised Learning
SimCLR[9] 90.83 97.67 81.67 42.32 51.10 36.36 49.44 60.02 39.29
MoCo[17] 87.83 95.52 80.03 40.56 49.41 36.52 45.35 58.11 37.89
SwAV[7] 91.28 97.21 82.02 44.39 54.91 37.13 49.39 62.20 40.19
Supervised Meta-learning
MAML[14] 93.22 97.53 82.36 45.84 63.25 36.77 48.25 58.00 39.52
ProtoNet[47] 95.83 99.29 92.80 46.58 63.20 40.11 51.28 69.55 46.65
CNAPs[43] 91.28 95.98 87.09 43.21 62.87 36.55 52.07 70.38 43.30

5.2 Cross-domain Self-supervised Few-shot Learning

Setup. We compare the effect of BMSSL and the baseline described by Seciton 5.1 on the cross-domain few-shot classification benchmarks, which is divided into two categories based on the similarity with ImageNet: i) high similarity: CUB [55], Cars [27], and Places [61]; ii) low similarity: CropDiseases [36], ISIC [11], and ChestX [54].

Results. Table 4 presents the performance of the model trained on miniImageNet for meta-learning on the benchmark datasets mentioned above. By observation, we further validate the performance of our proposed BMSSL: i) Effectiveness: achieves similar or even better results than the state-of-the-art baseline algorithms on all benchmark datasets; ii) Generalization: achieves nearly a 3% improvement compared to supervised meta-learning and self-supervised learning on the datasets with significant differences from the training phase; iii) Robustness: achieves similar results to the PsCo [22] which introduces out-of-distribution samples, even though we do not explicitly consider out-of-distribution samples on datasets with significant differences.

Table 4: Accuracy (%) of cross-domain few-shot classification with two types mentioned in Section 5.2. We transfer models trained on miniImageNet to each benchmark. The meanings of "(way, shot)", "Bold" and "Blue" in the table are consistent with Table 3.
Method CUB Cars Places CropDiseases ISIC ChestX
(5,5) (5,20) (5,5) (5,20) (5,5) (5,20) (5,5) (5,20) (5,5) (5,20) (5,5) (5,20)
Unsupervised meta-learning
Meta-SVEBM 45.893 54.823 33.530 44.622 50.516 61.561 71.652 84.515 37.106 48.001 27.238 29.652
Meta-GMVAE 48.783 55.651 30.205 39.946 55.361 65.520 72.683 80.777 30.630 37.574 24.522 26.239
PsCo 56.365 69.298 44.632 56.990 64.501 73.516 89.565 95.492 43.632 54.886 21.907 24.182
MetaSSL 54.238 65.031 45.341 56.526 62.538 70.022 83.922 90.058 40.140 50.209 24.827 25.238
BMSSL 57.543 69.561 49.636 59.511 67.250 75.834 87.524 95.950 46.518 56.293 29.463 30.389
Self-supervised Learning
SimCLR 51.389 60.011 38.639 52.412 59.523 68.419 80.360 89.161 44.669 51.823 26.556 30.982
MoCo 52.843 61.204 39.504 50.108 60.291 69.033 81.606 90.366 44.328 52.398 24.198 27.893
SwAV 51.250 61.645 36.352 51.153 58.789 68.512 80.055 89.917 43.200 50.109 21.252 28.270
Supervised Meta-learning
MAML 57.296 64.005 44.934 49.561 62.502 71.741 78.202 85.247 46.405 56.293 22.435 24.238
ProtoNet 56.237 64.829 40.893 48.123 59.887 69.207 76.651 84.164 40.028 49.289 22.219 25.839

5.3 Ablation Studies

Augmentations in task construction. Although we have shown in Figure 2 that augmentation cannot offset the impact of data scarcity, we have not yet explored the effects of different levels of augmentation on SSL task construction, which is directly related to the diversity and feature similarity of the samples in the task. We divide augmentation methods into four levels with different quantities (five kinds/one kind) and intensities (mild/strong, such as large/small area splicing), and apply them to evaluate the impact on the model. The experimental results in Table 5.3 shows that the benefits of data diversity to the model are limited, and augmentation strategies have little effect on the model.

The effect of bi-level optimization. BMSSL introduces learning experience through two loops of gradient update and gives the model the ability to optimize and constrain twice. To evaluate the its effect, we fix the constraint structure of the inner loop and compared it under the following three settings: i) one-time optimization + no introduction of experience (1\mathcal{M}_{1}): only containing task-specific models with random initialization; ii) one-time optimization + introduction of experience (2\mathcal{M}_{2}): only updating the model’s metric learning for specific tasks, and using it as experience; iii) two-time optimization + introduction of experience (3\mathcal{M}_{3}): BMSSL’s bi-level optimization structure. The results of this ablation experiment are shown in Table 5.3. BMSSL achieves nearly a 4% improvement, demonstrating the gain from priors and model structure on the algorithm.

Training LL and δ\delta. To find the optimal parameters of the model, we test the model on miniImageNet with different settings of LL and δ\delta. Tables 5.3 shows the model’s accuracy and running efficiency when L=5L=5, which runs on a V100 NVIDIA GPU. We find δ=5\delta=5 may be the best choice where the further increase has little effect on the accuracy, but the operating efficiency drops greatly. The introduction of δ\delta can adjust the efficiency of the model and achieve faster convergence through this distribution approximation. The full results and further analysis areavailable at Appendix E.

Table 5: Accuracy(%) on miniImageNet with four levels of data augmentation expressed as {𝒜i}i=14\left\{\mathcal{A}_{i}\right\}_{i=1}^{4}.
Levels Top-1 Acc(%)
𝒜1\mathcal{A}_{1} 49.730±\pm0.303
𝒜2\mathcal{A}_{2} 49.990±\pm0.238
𝒜3\mathcal{A}_{3} 49.789±\pm0.210
𝒜4\mathcal{A}_{4} 49.832±\pm0.199
Table 6: Accuracy(%) on miniImageNet under three types of model structures mentioned in Section 5.3.
Method Top-1 Acc(%)
1\mathcal{M}_{1} 39.583±\pm0.482
2\mathcal{M}_{2} 46.184±\pm0.298
3\mathcal{M}_{3} 49.987±\pm0.283
Table 7: Accuracy(%) and meta-training steps(/s/s) when L=5L=5 on miniImageNet with different δ\delta.
δ\delta Top-5 ACC(%) Steps(/s/s)
1 63.832 4.3
5 64.443 3.2
10 64.592 2.6
15 64.588 2.1
20 64.605 1.7

6 Conclusion

In this work, we propose a novel Bootstrapped Meta Self-Supervised Learning framework that simulates three advantages of human learning: i) without supervision, ii) not limited by data, and iii) learning from experience. We discuss the relationship between self-supervised learning and meta-learning, and leveraging our findings to propose a simple but clever approach for reconstructuring self-supervised tasks. Additionally, we employ bi-level optimizations to introduce experience for learning, and use meta-gradients to generate bootstrapped target to make the model its own teacher. Through theoretical analysis and extensive experiments, we demonstrate the superior performance of our framework.

Broader Impact and Limitations. This work offers a reliable way for machines to mimic human learning, providing technological advances in machine learning. We do not need to be data-bound or long-term training as previous methods. But it has a limitation that the evaluation focuses on visual tasks, without considering the learning effects of other fields (e.g., reinforcement learning and language recognition) or other tasks (e.g., regression, generation).

References

  • [1] Saleh Albelwi. Survey on self-supervised learning: auxiliary pretext tasks and contrastive learning methods in imaging. Entropy, 24(4):551, 2022.
  • [2] Mustafa Sercan Amac, Ahmet Sencan, Bugra Baran, Nazli Ikizler-Cinbis, and Ramazan Gokberk Cinbis. Masksplit: Self-supervised meta-learning for few-shot semantic segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1067–1077, 2022.
  • [3] Philip Bachman, R Devon Hjelm, and William Buchwalter. Learning representations by maximizing mutual information across views. Advances in neural information processing systems, 32, 2019.
  • [4] Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli. Data2vec: A general framework for self-supervised learning in speech, vision and language. In International Conference on Machine Learning, pages 1298–1312. PMLR, 2022.
  • [5] Trapit Bansal, Rishikesh Jha, Tsendsuren Munkhdalai, and Andrew McCallum. Self-supervised meta-learning for few-shot natural language classification tasks. arXiv preprint arXiv:2009.08445, 2020.
  • [6] Luca Bertinetto, Joao F Henriques, Philip HS Torr, and Andrea Vedaldi. Meta-learning with differentiable closed-form solvers. arXiv preprint arXiv:1805.08136, 2018.
  • [7] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems, 33:9912–9924, 2020.
  • [8] John B Carroll et al. Human cognitive abilities: A survey of factor-analytic studies. Number 1. Cambridge University Press, 1993.
  • [9] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
  • [10] Yinbo Chen, Zhuang Liu, Huijuan Xu, Trevor Darrell, and Xiaolong Wang. Meta-baseline: Exploring simple meta-learning for few-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9062–9071, 2021.
  • [11] Noel CF Codella, David Gutman, M Emre Celebi, Brian Helba, Michael A Marchetti, Stephen W Dusza, Aadi Kalloo, Konstantinos Liopyris, Nabin Mishra, Harald Kittler, et al. Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (isbi), hosted by the international skin imaging collaboration (isic). In 2018 IEEE 15th international symposium on biomedical imaging (ISBI 2018), pages 168–172. IEEE, 2018.
  • [12] Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Ghalia Rehawi, Yu Wang, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Martin Steinegger, et al. Prottrans: Toward understanding the language of life through self-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 44(10):7112–7127, 2021.
  • [13] Xiaomin Fang, Jizhou Huang, Fan Wang, Lihang Liu, Yibo Sun, and Haifeng Wang. Ssml: Self-supervised meta-learner for en route travel time estimation at baidu maps. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 2840–2848, 2021.
  • [14] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pages 1126–1135. PMLR, 2017.
  • [15] Sebastian Flennerhag, Yannick Schroecker, Tom Zahavy, Hado van Hasselt, David Silver, and Satinder Singh. Bootstrapped meta-learning. arXiv preprint arXiv:2109.04504, 2021.
  • [16] Adam Foster, Rattana Pukdee, and Tom Rainforth. Improving transformation invariance in contrastive representation learning. arXiv preprint arXiv:2010.09515, 2020.
  • [17] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
  • [18] Timothy Hospedales, Antreas Antoniou, Paul Micaelli, and Amos Storkey. Meta-learning in neural networks: A survey. IEEE transactions on pattern analysis and machine intelligence, 44(9):5149–5169, 2021.
  • [19] Kyle Hsu, Sergey Levine, and Chelsea Finn. Unsupervised learning via meta-learning. arXiv preprint arXiv:1810.02334, 2018.
  • [20] Mike Huisman, Jan N Van Rijn, and Aske Plaat. A survey of deep meta-learning. Artificial Intelligence Review, 54(6):4483–4541, 2021.
  • [21] Ashish Jaiswal, Ashwin Ramesh Babu, Mohammad Zaki Zadeh, Debapriya Banerjee, and Fillia Makedon. A survey on contrastive self-supervised learning. Technologies, 9(1):2, 2020.
  • [22] Huiwon Jang, Hankook Lee, and Jinwoo Shin. Unsupervised meta-learning via few-shot pseudo-supervised contrastive learning. arXiv preprint arXiv:2303.00996, 2023.
  • [23] Chia-Hsiang Kao, Wei-Chen Chiu, and Pin-Yu Chen. Maml is a noisy contrastive learner in classification. arXiv preprint arXiv:2106.15367, 2021.
  • [24] Siavash Khodadadeh, Ladislau Boloni, and Mubarak Shah. Unsupervised meta-learning for few-shot image classification. Advances in neural information processing systems, 32, 2019.
  • [25] Siavash Khodadadeh, Sharare Zehtabian, Saeed Vahidian, Weijia Wang, Bill Lin, and Ladislau Bölöni. Unsupervised meta-learning through latent-space interpolation in generative models. arXiv preprint arXiv:2006.10236, 2020.
  • [26] Deqian Kong, Bo Pang, and Ying Nian Wu. Unsupervised meta-learning via latent space energy-based model of symbol vector coupling. In Fifth Workshop on Meta-Learning at the Conference on Neural Information Processing Systems, 2021.
  • [27] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013.
  • [28] Rayan Krishnan, Pranav Rajpurkar, and Eric J Topol. Self-supervised learning in medicine and healthcare. Nature Biomedical Engineering, pages 1–7, 2022.
  • [29] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • [30] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. The omniglot challenge: a 3-year progress report. Current Opinion in Behavioral Sciences, 29:97–104, 2019.
  • [31] Dong Bok Lee, Dongchan Min, Seanie Lee, and Sung Ju Hwang. Meta-gmvae: Mixture of gaussian vae for unsupervised meta-learning. In International Conference on Learning Representations, 2021.
  • [32] Chun-Liang Li, Kihyuk Sohn, Jinsung Yoon, and Tomas Pfister. Cutpaste: Self-supervised learning for anomaly detection and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9664–9674, 2021.
  • [33] Xiao Liu, Fanjin Zhang, Zhenyu Hou, Li Mian, Zhaoyu Wang, Jing Zhang, and Jie Tang. Self-supervised learning: Generative or contrastive. IEEE Transactions on Knowledge and Data Engineering, 35(1):857–876, 2021.
  • [34] Shuai Luo, Yujie Li, Pengxiang Gao, Yichuan Wang, and Seiichi Serikawa. Meta-seg: A survey of meta-learning for image segmentation. Pattern Recognition, page 108586, 2022.
  • [35] John J McArdle and Richard W Woodcock. Human cognitive abilities in theory and practice. Psychology Press, 2014.
  • [36] Sharada P Mohanty, David P Hughes, and Marcel Salathé. Using deep learning for image-based plant disease detection. Frontiers in plant science, 7:1419, 2016.
  • [37] Alex Nichol and John Schulman. Reptile: a scalable metalearning algorithm. arXiv preprint arXiv:1803.02999, 2(3):4, 2018.
  • [38] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  • [39] Sangwoo Park, Osvaldo Simeone, and Joonhyuk Kang. Meta-learning to communicate: Fast end-to-end training for fading channels. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5075–5079. IEEE, 2020.
  • [40] Alexandra C Pike and Oliver J Robinson. Reinforcement learning in patients with mood and anxiety disorders vs control individuals: A systematic review and meta-analysis. JAMA psychiatry, 2022.
  • [41] Mengye Ren, Eleni Triantafillou, Sachin Ravi, Jake Snell, Kevin Swersky, Joshua B Tenenbaum, Hugo Larochelle, and Richard S Zemel. Meta-learning for semi-supervised few-shot classification. arXiv preprint arXiv:1803.00676, 2018.
  • [42] Sucheng Ren, Huiyu Wang, Zhengqi Gao, Shengfeng He, Alan Yuille, Yuyin Zhou, and Cihang Xie. A simple data mixing prior for improving self-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14595–14604, 2022.
  • [43] James Requeima, Jonathan Gordon, John Bronskill, Sebastian Nowozin, and Richard E Turner. Fast and flexible multi-task classification using conditional neural adaptive processes. Advances in Neural Information Processing Systems, 32, 2019.
  • [44] Madeline C Schiappa, Yogesh S Rawat, and Mubarak Shah. Self-supervised learning for videos: A survey. ACM Computing Surveys, 2022.
  • [45] Connor Shorten and Taghi M Khoshgoftaar. A survey on image data augmentation for deep learning. Journal of big data, 6(1):1–48, 2019.
  • [46] Saeed Shurrab and Rehab Duwairi. Self-supervised learning methods and applications in medical imaging analysis: A survey. PeerJ Computer Science, 8:e1045, 2022.
  • [47] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30, 2017.
  • [48] Qianru Sun, Yaoyao Liu, Tat-Seng Chua, and Bernt Schiele. Meta-transfer learning for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 403–412, 2019.
  • [49] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1199–1208, 2018.
  • [50] Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. What makes for good views for contrastive learning? Advances in neural information processing systems, 33:6827–6839, 2020.
  • [51] Joaquin Vanschoren. Meta-learning: A survey. arXiv preprint arXiv:1810.03548, 2018.
  • [52] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. Advances in neural information processing systems, 29, 2016.
  • [53] Haoqi Wang, Zhizhong Li, Litong Feng, and Wayne Zhang. Vim: Out-of-distribution with virtual-logit matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4921–4930, 2022.
  • [54] Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M Summers. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2097–2106, 2017.
  • [55] Peter Welinder, Steve Branson, Takeshi Mita, Catherine Wah, Florian Schroff, Serge Belongie, and Pietro Perona. Caltech-ucsd birds 200. 2010.
  • [56] Suorong Yang, Weikang Xiao, Mengcheng Zhang, Suhan Guo, Jian Zhao, and Furao Shen. Image data augmentation for deep learning: A survey. arXiv preprint arXiv:2204.08610, 2022.
  • [57] Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. Barlow twins: Self-supervised learning via redundancy reduction. In International Conference on Machine Learning, pages 12310–12320. PMLR, 2021.
  • [58] Xueting Zhang, Debin Meng, Henry Gouk, and Timothy M Hospedales. Shallow bayesian meta learning for real-world few-shot recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 651–660, 2021.
  • [59] Yabin Zhang, Hui Tang, and Kui Jia. Fine-grained visual categorization using meta-learning optimization with sample selection of auxiliary data. In Proceedings of the european conference on computer vision (ECCV), pages 233–248, 2018.
  • [60] Tony Z Zhao, Jianlan Luo, Oleg Sushkov, Rugile Pevceviciute, Nicolas Heess, Jon Scholz, Stefan Schaal, and Sergey Levine. Offline meta-reinforcement learning for industrial insertion. In 2022 International Conference on Robotics and Automation (ICRA), pages 6386–6393. IEEE, 2022.
  • [61] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence, 40(6):1452–1464, 2017.
  • [62] Qing Zhu, Qirong Mao, Hongjie Jia, Ocquaye Elias Nii Noi, and Juanjuan Tu. Convolutional relation network for facial expression recognition in the wild with few-shot learning. Expert Systems with Applications, 189:116046, 2022.
  • [63] Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. Graph contrastive learning with adaptive augmentation. In Proceedings of the Web Conference 2021, pages 2069–2080, 2021.
  • [64] Roland S Zimmermann, Yash Sharma, Steffen Schneider, Matthias Bethge, and Wieland Brendel. Contrastive learning inverts the data generating process. In International Conference on Machine Learning, pages 12979–12990. PMLR, 2021.