This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Improving the Generalization of Meta-learning on Unseen Domains
via Adversarial Shift

Pinzhuo Tian
Nanjing University
tianpinzhuo@gmail.com
   Yang Gao
Nanjing University
gaoy@nju.edu.cn
Abstract

Meta-learning provides a promising way for learning to efficiently learn and achieves great success in many applications. However, most meta-learning literature focuses on dealing with tasks from a same domain, making it brittle to generalize to tasks from the other unseen domains. In this work, we address this problem by simulating tasks from the other unseen domains to improve the generalization and robustness of meta-learning method. Specifically, we propose a model-agnostic shift layer to learn how to simulate the domain shift and generate pseudo tasks, and develop a new adversarial learning-to-learn mechanism to train it. Based on the pseudo tasks, the meta-learning model can learn cross-domain meta-knowledge, which can generalize well on unseen domains. We conduct extensive experiments under the domain generalization setting. Experimental results demonstrate that the proposed shift layer is applicable to various meta-learning frameworks. Moreover, our method also leads to state-of-the-art performance on different cross-domain few-shot classification benchmarks and produces good results on cross-domain few-shot regression.

1 Introduction

Learning quickly is a hallmark of human intelligence, even a child can recognize objects from a few examples. Fortunately, meta-learning provides a promising strategy for enabling efficient learning from a few supervised information, and achieves great success in many fields [28, 26], especially in few-shot learning [14, 46]. However, compared with humans who can easily utilize experience from a seen environment (or domain) to help efficiently learn tasks from other unseen domains, most meta-learning models thus far have focused on the situation where all the tasks are from a same domain. The talent of generalizing the experience to unseen domains is still a challenge for recent meta-learning methods.

Moreover, the ability of meta-learning to generalize to unseen domains is also critical in practice, due to many settings where meta-learning is applied for essentially referring to a cross-domain problem. For example, it’s impossible to construct large training datasets for rare classes (e.g., some rare bird species, or some diseases), and the auxiliary set for training the meta-learning model is usually from the other domains where the annotated data is easily collected. Therefore, the meta-learning method is wished to leverage the meta-knowledge from seen domain to help efficiently study in the unseen domains.

Although there are some works had paid attention to this problem [40, 25, 36], almost all the methods tailor for classification, leading to limited applications. Typically, feature-wise transformation [40] just can be applied to metric-based meta-learning models, and [36] shows comparable performance to [40]. STARTUP [25] must use unlabeled data from the target domain. Moreover, the performances of some methods are shown to be sensitive to the degree of the domain shift, even underperform the traditional meta-learning methods, when exists a large domain discrepancy between the training and target domains [12].

Therefore, in this work, we aim to propose a model-agnostic and domain-free method to improve the generalization of various meta-learning frameworks on unseen domains, in the sense that it can be applied to different learning problems, and robust whatever the domain shift is small or large. Moreover, we don’t need the data from the unseen domains. The core idea is to generate tasks from other unseen domains, and utilize these pseudo tasks combined with true tasks sampled from the source domain to elegantly disentangle how to learn domain-invariant meta-knowledge, which can improve the generalization of meta-learning on unseen domains. In order to achieve this goal, we propose a shift layer to learn how to simulate the domain shift and generate tasks from unseen domains. For training it, we also develop a new adversarial learning-to-learn mechanism. In this way, the meta-learning model and the shift layer can be jointly trained end-to-end.

We evaluate the proposed method with different meta-learning models on both regression and classification problems. Experiments demonstrate that our method is model-agnostic and robust to the degree of the domain shift.

Three primary contributions of this work are followed:

  • We propose a shift layer to generate pseudo tasks from unseen domains. With these pseudo tasks, the meta-learning model can easily learn cross-domain meta-knowledge.

  • We develop an adversarial learning-to-learn mechanism to help the shift layer capture how to generate appropriate tasks which benefit for improving the generalization of the meta-learning model.

  • The experimental results show that our method can achieve state-of-the-art performance on cross-domain few-shot classification, and also effectively improves the generalization of various meta-learning models on unseen domains in few-shot regression.

2 Related Work

2.1 Meta-learning

Meta-learning aims to assist the learning process in the new task by studying how learning models perform on each learning task. Recent meta-learning methods can be broadly divided into three categories, metric-based, gradient-based, and model-based methods.

Metric-based methods. Metric-based meta-learning framework can be considered as learning to compare, and a nonparametric similarity function is designed to evaluate the similarity between examples. For example, Matching networks [41] firstly use attention recurrent network as a feature encoder to mapping images from different classes to a common meta-feature space, and applies cosine similarity to obtain the predicted result, Prototypical networks [35] adopt euclidean distance, and RelationNet [37] directly utilizes a deep distance metric to measure the similarity. In general, metric-based methods are simply and effective, however thus far are restricted to classification.

Model-based methods. In this category, meta-learning models are usually designed as a parameterized predictor to generate parameters for the new tasks. For example, Ravi et al. [29] and Santoro et al. [31] both used the recurrent neural network as the predictor.

Gradient-based methods. Gradient-based methods focus on extracting meta-knowledge required to improve the optimization performance. Model-agnostic meta-learning (MAML) [10] regards the initialization of deep network as meta-knowledge and aims to learn a good initialization for all tasks, so that the learner of a new task just needs a few gradient steps from this initialization. R2D2 [4] and MetaOpt [18] adopt ridge regression or support vector machine [7] as the task-specific learner for each learning task, respectively. With these linear classification models, they can learn a feature embedding model which can generalize well on the new task. Compared with the metric-based method, gradient-based method can be applied to many learning problems, e.g. regression and reinforcement learning. However it usually suffers from second-order derivatives, Raghu et al. [27] proposed ANIL to relieve this problem.

2.2 Domain Adaptation

There is a substantial body of work on domain adaptation [44], which aims to learn from one or multiple source domains a well-performing model on a target domain. Early methods of domain adaptation generally rely on instance re-weighting [8] or model parameter adaptation [48]. Since the emergence of domain adversarial neural networks (DANN) [11], recent frameworks [19, 47] are mainly based on applying adversarial training to alleviate the domain shift existing in source and target domains. There are also some methods using the discrepancy-based framework to align the marginal distribution between domains [21, 15].

However, most frameworks on domain adaptation are followed two strict priori, i.e., the label spaces of source and target domains are same and numerous unlabeled images in the target domain are available. According to the argumentations in [40, 12], these assumptions may not be realistic and restrict the domain adaptation framework to handle novel concepts. Our works consider the scenario of how to improve the generalization of the learning model on the new concept from unseen domains.

Refer to caption
Figure 1: Method overview. In each updating, we sample three tasks from the source domain, e.g., 𝒯1\mathcal{T}_{1}, 𝒯2\mathcal{T}_{2} and 𝒯3\mathcal{T}_{3} in this figure. 𝒯1\mathcal{T}_{1} is used to help the proposed shift layer with the initialization ϕ\boldsymbol{\phi} learn how to simulate the domain shift in unseen domains. Then, based on 𝒯2\mathcal{T}_{2}, the adapted shift layer ϕ\boldsymbol{\phi}^{\prime} can generate a pseudo task 𝒯^2\hat{\mathcal{T}}_{2} that is supposed to be from the other domains. Finally, 𝒯^2\hat{\mathcal{T}}_{2} and 𝒯3\mathcal{T}_{3} are used to optimize the meta-parameter together.

2.3 Domain Generalization

In contrast to domain adaptation, domain generalization methods [3, 33, 23] devote to learn features that perform well when transferred to unseen domains. These models needn’t data from the unseen domains during the training stage. Most recently, meta-learning based approaches [3, 20] are proposed in domain generalization and achieve impressive results. The ideas of these methods are using episodic training to simulate the domain shift during the training and evaluation stages. By this way, better generalization on the unseen domains is achieved. Yet, existing domain generalization approaches still aim at tackling the problem under the assumption that the instances in the training stage share the same label space with the data in the unseen domain. Besides label space, these algorithms also require to access the training instances drawn from different source domains, not a single one. Our method does not have this limitation.

Compared with domain adaptation and generalization, this work studies a more challenging setting, which just requires one single source domain and needs to generalize well on the new concepts from new domains.

2.4 Cross-domain Few-shot Classification

Cross-domain few-shot classification is a scenario which our work can be applied to. In cross-domain few-shot classification, training and novel classes are drawn from different domains, and the class label sets are disjoint. This scenario is very difficult, therefore, limited works aim at cross-domain few-shot classification. Typically, Tsing et al. [40] used feature-wise transform to improve the generalization ability of the learned representations, and Guo et al. [12] implemented a broader study of cross-domain few-shot classification and proposed a challenging benchmark. Phoo et al. [25] introduced a self-training approach that allows few-shot learners to adapt feature representations to the target domain by some unlabeled data from the target domain. LRP [36] uses explanation-guided training to improve the performance.

However, all these approaches are tailored for classification. Specifically, some of them are sensitive to the degree of the domain shift or rely on the transductive setting. Our method is model-agnostic and inductive which can be applied to many learning problems.

3 Preliminary

We firstly present the meta-learning problem in the typical few-shot learning, and then generalize to the setting that our method aims to solve.

3.1 Meta-learning in Typical Few-shot Learning

In the typical few-shot learning, meta-learning model accesses to a set of training tasks 𝒮meta={𝒯i}iT\mathcal{S}^{\text{meta}}=\{\mathcal{T}_{i}\}^{T}_{i} drawn from a task distribution P(𝒯)P(\mathcal{T}). Each task 𝒯i\mathcal{T}_{i} contains a dataset 𝒟i\mathcal{D}_{i}, which is usually divided into two disjoint sets: 𝒟itr\mathcal{D}^{\text{tr}}_{i} and 𝒟its\mathcal{D}^{\text{ts}}_{i}. These datasets each is associated with KK data-label pairs, i.e., let 𝐱𝒳\mathbf{x}\in\mathcal{X} and 𝐲𝒴\mathbf{y}\in\mathcal{Y} denote data and their label, respectively, 𝒟itr={(𝐱ik,𝐲ik)}i=1K\mathcal{D}^{\text{tr}}_{i}=\{(\mathbf{x}^{k}_{i},\mathbf{y}^{k}_{i})\}^{K}_{i=1}, and similarity for 𝒟its\mathcal{D}^{\text{ts}}_{i}. In general, 𝒟itr\mathcal{D}^{\text{tr}}_{i} and 𝒟its\mathcal{D}^{\text{ts}}_{i} in the same task share the same label space. Different tasks have different label spaces.

We suppose that all the tasks share a same learning algorithm 𝒜lg\mathcal{A}lg and a loss function \mathcal{L}. Each task 𝒯i\mathcal{T}_{i} has itself learning model (or base-learner) 𝒜lgi\mathcal{A}lg_{i} parameterized by 𝐰id\mathbf{w}_{i}\in\mathbb{R}^{d}. Meta-learning model is interested in learning a meta-learner, e.g., a neural network, from the meta-training set 𝒮meta\mathcal{S}^{\text{meta}}, which can help the base-learner 𝒜lgj\mathcal{A}lg_{j} of a new task 𝒯j\mathcal{T}_{j} efficiently learn with a few labeled data. This motivation can be formulated as below:

min𝜽i=1T(𝜽,𝐰i;𝒟its)\displaystyle\min_{\boldsymbol{\theta}}\sum_{i=1}^{T}\mathcal{L}(\boldsymbol{\theta},\mathbf{w}_{i};\mathcal{D}^{\text{ts}}_{i}) (1)
s.t.𝐰i=min𝐰(𝐰;𝜽,𝒟itr),\displaystyle\text{s.t.}~{}\mathbf{w}_{i}=\min_{\mathbf{w}}\mathcal{L}(\mathbf{w};\boldsymbol{\theta},\mathcal{D}^{\text{tr}}_{i}), (2)

where 𝜽\boldsymbol{\theta} denotes the meta-parameter. Particularly, Eq. 1 and Eq. 2 can be solved as a bi-level optimization problem [6].

Note: Metric-based meta-learning method doesn’t have the step of Eq. 2, because the base-learners of these methods are a nonparametric distance function.

In the meta-test stage, when faced a new task 𝒯jP(𝒯)\mathcal{T}_{j}\in P(\mathcal{T}), the base-learner 𝒜lgj\mathcal{A}lg_{j} can achieve good performance by using the adaptation procedure of 𝐰j\mathbf{w}_{j} with the learned meta-parameter.

3.2 Review of Cross-domain Setting

Different from the typical few-shot learning, we address the few-shot problem under the domain generalization setting. In other words, the meta-training set 𝒮meta={𝒯i}i=iT\mathcal{S}^{\text{meta}}=\{\mathcal{T}_{i}\}^{T}_{i=i} is sampled from a seen (source) domain, and the meta-learning model trained on 𝒮meta\mathcal{S}^{\text{meta}} is supposed to help new tasks from the other unseen domains learn fast. More specifically, there exists a distribution discrepancy between dataset 𝒟i\mathcal{D}_{i} in the training task 𝒯i𝒮meta\mathcal{T}_{i}\in\mathcal{S}^{\text{meta}} and 𝒟j\mathcal{D}_{j} in the new task 𝒯j\mathcal{T}_{j}. Moreover, we can’t access the images in the unseen domain at the training stage.

4 Methodology

In this paper, our main idea is to learn how to simulate the domain shift existing in different domains. In this way, we can generate the pseudo tasks from the other domains. With these tasks, the generalization of the meta-learning system on the real unseen domains is supposed to be indeed improved.

4.1 Feature-wise Shift Layer

Firstly, we introduce a Feature-wise Shift Layer (FiSL) which is used to simulate the domain shift in our method. The architecture of FiSL is based on feature-wise transformation [24] which is proven to be capable to represent domain-specific information in many works [9, 24]. In few-shot learning, feature-wise transformation is already adopted to dynamically represent domain-specific [40] and task-specific information [22].

Suppose that the meta-learner contains a feature encoder f𝜽f_{\boldsymbol{\theta}} with the parameter 𝜽Θ\boldsymbol{\theta}\in\Theta, we are given a feature activation map 𝐳0𝒵\mathbf{z}_{0}\in\mathcal{Z} of an image 𝐱0\mathbf{x}_{0} from the last layer of the feature encoder with the dimension of C×H×WC\times H\times W. The output 𝐳\mathbf{z} of our shift layer is

𝐳=𝜸𝐳0+𝜷,where𝐳0=f𝜽(𝐱0)C×H×W,\mathbf{z}=\boldsymbol{\gamma}\odot\mathbf{z}_{0}+\boldsymbol{\beta},~{}~{}\text{where}~{}~{}\mathbf{z}_{0}=f_{\boldsymbol{\theta}}(\mathbf{x}_{0})\in\mathbb{R}^{C\times H\times W}, (3)

𝜸\boldsymbol{\gamma} and 𝜷\boldsymbol{\beta} are learnable scaling and shift vectors applied to affine transformation. For easy notation, Eq. 3 can be denoted by 𝐳=FiSL(𝐱0)\mathbf{z}=\text{FiSL}(\mathbf{x}_{0}).

After training, the shift layer with ϕ={𝜸,𝜷}\boldsymbol{\phi}=\{\boldsymbol{\gamma},\boldsymbol{\beta}\} is supposed to be enable to transfer the images from the source domain to other unseen domains.

4.2 Adversarial Learning-to-learn Mechanism

However, how to train a meta-learning model with FiSL is an intractable problem because of 1. How to make FiSL learn the way of generating pseudo tasks. 2. How to make meta-learning model learn useful domain-invariant meta-knowledge from these tasks.

4.2.1 How to Generate Pseudo Tasks

In the first problem, we are interested in training FiSL in a single domain P0P_{0} to simulate the domain shift and generate pseudo tasks from unforeseen domains PP for improving the generalization and robustness of meta-learning.

Inspired by the recent developments in robust optimization and adversarial data augmentation [34, 42], we consider the first problem following the worst-case problem around the (training) source distribution P0P_{0}, as

min𝜽supP:D(P,P0)ρ𝔼P[(𝜽,𝐰;𝒟its)],where𝐰i=argmin𝐰(𝐰;𝜽,𝒟itr).\begin{split}\min_{\boldsymbol{\theta}}&\sup_{P:D(P,P_{0})\leq\rho}\mathbb{E}_{P}[\mathcal{L}(\boldsymbol{\theta},\mathbf{w}^{*};\mathcal{D}^{\text{ts}}_{i})],\\ &\text{where}~{}\mathbf{w}^{*}_{i}=\mathop{\arg\min}_{\mathbf{w}}\mathcal{L}(\mathbf{w};\boldsymbol{\theta},\mathcal{D}^{\text{tr}}_{i}).\end{split} (4)

Here, P0P_{0} represents the distribution that images in the seen (source) domain follow. PP is the distribution that FiSL simulates. D(P,P0)D(P,P_{0}) is a distance metric on the space of probability distributions. 𝒟itr\mathcal{D}^{\text{tr}}_{i} and 𝒟its\mathcal{D}^{\text{ts}}_{i} are the support (training) and query (test) sets of task 𝒯i\mathcal{T}_{i} from the source domain P0P_{0}.

The solution of Eq. 4 guarantees good performance (robustness) of the learned 𝜽\boldsymbol{\theta} against any data distribution PP that is ρ\rho distance away from the source domain P0P_{0}. In other words, meta-parameter 𝜽\boldsymbol{\theta} can achieve good generalization on unseen tasks by solving Eq. 4.

We firstly focus on how to simulate the unforeseen distributions PP by FiSL. To preserve the semantics of the input samples, similar to [42, 51], we use Wasserstein distance defined in the latent feature 𝒵\mathcal{Z} as our metric DD to constrain the distributions FiSL simulates. Let c𝜽:𝒵×𝒵+{}c_{\boldsymbol{\theta}}:\mathcal{Z}\times\mathcal{Z}\rightarrow\mathbb{R}_{+}\cup\{\infty\} denote the transportation cost of moving mass from (𝐱0,𝐲0)(\mathbf{x}_{0},\mathbf{y}_{0}) to (𝐱,𝐲)(\mathbf{x},\mathbf{y}), as

c𝜽((𝐱0,𝐲0),(𝐱,𝐲))12𝐳0𝐳22+𝟏{𝐲0𝐲},c_{\boldsymbol{\theta}}((\mathbf{x}_{0},\mathbf{y}_{0}),(\mathbf{x},\mathbf{y}))\coloneqq\frac{1}{2}\|\mathbf{z}_{0}-\mathbf{z}\|^{2}_{2}+\infty\cdot\mathbf{1}\{\mathbf{y}_{0}\neq\mathbf{y}\}, (5)

where 𝐳0=f𝜽(𝐱0)\mathbf{z}_{0}=f_{\boldsymbol{\theta}}(\mathbf{x}_{0}) and 𝐳=FiSL(𝐱0)\mathbf{z}=\text{FiSL}(\mathbf{x}_{0}). 𝐱\mathbf{x} is the pseudo data of 𝐳\mathbf{z}. For probability measures PP and P0P_{0} supported on 𝒵\mathcal{Z}, we consider that Π(P,P0)\Pi(P,P_{0}) denotes their couplings. Then, the notion of our metric is defined by

D𝜽(P,P0)infMΠ(P,P0)𝔼M[c𝜽((𝐱0,𝐲0),(𝐱,𝐲))].D_{\boldsymbol{\theta}}(P,P_{0})\coloneqq\inf_{M\in\Pi(P,P_{0})}\mathbb{E}_{M}[c_{\boldsymbol{\theta}}((\mathbf{x}_{0},\mathbf{y}_{0}),(\mathbf{x},\mathbf{y}))]. (6)

Armed with this notion of distance on the semantic space, we now consider a variant of the worst-case problem Eq. 4 where we replace the distance with D𝜽D_{\boldsymbol{\theta}} in Eq. 6, our adaptive notion of distance defined on the semantic space is

min𝜽supP{𝔼P[(𝜽,𝐰i;𝒟its):D𝜽(P,P0)]ρ}.\min_{\boldsymbol{\theta}}\sup_{P}\{\mathbb{E}_{P}[\mathcal{L}(\boldsymbol{\theta},\mathbf{w}^{*}_{i};\mathcal{D}^{\text{ts}}_{i}):D_{\boldsymbol{\theta}}(P,P_{0})]\leq\rho\}. (7)

However, for deep neural networks, this formulation is intractable with arbitrary ρ\rho. Instead, following the reformulation of [34, 42], we consider its Lagrangian relaxation \mathcal{F} for a fixed penalty parameter γ\gamma

min𝜽supP{𝔼P[(𝜽,𝐰i;𝒟its)γD𝜽(P,P0)]},\min_{\boldsymbol{\theta}}\sup_{P}\{\mathbb{E}_{P}[\mathcal{L}(\boldsymbol{\theta},\mathbf{w}^{*}_{i};\mathcal{D}^{\text{ts}}_{i})-\gamma D_{\boldsymbol{\theta}}(P,P_{0})]\}, (8)

where 𝐰i=argmin𝐰=(𝐰;𝜽,𝒟itr)\mathbf{w}^{*}_{i}=\mathop{\arg\min}_{\mathbf{w}}=\mathcal{L}(\mathbf{w};\boldsymbol{\theta},\mathcal{D}^{\text{tr}}_{i}). Taking the dual reformulation of the penalty relaxation Eq. 8, we can obtain an efficient solution procedure: simulating the unseen distribution PP by FiSL, learning the robust 𝜽\boldsymbol{\theta} with it.

According to Theorem 1 that is a minor adaptation of Lemma 1 in [42], we propose an iterative training procedure to solve the penalty problem (Eq. 8).

Theorem 1

Let (Θ×d)×(𝒳×𝒴)R(\Theta\times\mathbb{R}^{d})\times(\mathcal{X}\times\mathcal{Y})\rightarrow R and Let ϕr\phi_{r} denote the robust surrogate loss. Then, for any distribution P0P_{0} and any γ0\gamma\geq 0, we have that,

supP{𝔼P[(𝜽,𝐰;𝒟its))γD𝜽(P,P0)]}=𝔼(𝐱,𝐲)𝒟its[ϕγ(𝜽,𝐰;𝐱,𝐲)],where\begin{split}&\sup_{P}\{\mathbb{E}_{P}[\mathcal{L}(\boldsymbol{\theta},\mathbf{w};\mathcal{D}^{\text{ts}}_{i}))-\gamma D_{\boldsymbol{\theta}}(P,P_{0})]\}\\ &=\mathbb{E}_{(\mathbf{x},\mathbf{y})\in\mathcal{D}^{\text{ts}}_{i}}[\phi_{\gamma}(\boldsymbol{\theta},\mathbf{w};\mathbf{x},\mathbf{y})],~{}~{}~{}~{}~{}\text{where}\end{split} (9)
ϕγ(𝜽,𝐰;𝐱0,𝐲0)=sup𝐱𝒳{(𝜽,𝐰;𝐱,𝐲0)γc𝜽((𝐱0,𝐲0),(𝐱,𝐲0))}.\begin{split}&\phi_{\gamma}(\boldsymbol{\theta},\mathbf{w};\mathbf{x}_{0},\mathbf{y}_{0})\\ &=\sup_{\mathbf{x}\in\mathcal{X}}\{\mathcal{L}(\boldsymbol{\theta},\mathbf{w};\mathbf{x},\mathbf{y}_{0})-\gamma c_{\boldsymbol{\theta}}((\mathbf{x}_{0},\mathbf{y}_{0}),(\mathbf{x},\mathbf{y}_{0}))\}.\end{split} (10)

Our training procedure contains two phases: a maximization phase where FiSL learns how to simulate the domain shift by computing the maximization problem (Eq. 10) and a minimization phase, where meta-parameter 𝜽\boldsymbol{\theta} can perform stochastic gradient descent procedures on the robust surrogate ϕγ\phi_{\gamma}. Note that 𝐱\mathbf{x} in Eq. 10 is generated by FiSL in our method.

Maximization phase. In the maximization phase, a new task 𝒯j\mathcal{T}_{j} drawn from the source domain P0P_{0} is given to help FiSL learn how to simulate the domain shift. This phase can be formulated as

ϕ=ϕ+η{𝔼(𝐱0,𝐲0)𝒟jts(𝜽,𝐰j;𝐱,𝐲0)γc𝜽((𝐱0,𝐲𝟎),(𝐱,𝐲0))},\begin{split}\boldsymbol{\phi}^{\prime}=\boldsymbol{\phi}+\eta\nabla\{&\mathbb{E}_{(\mathbf{x}_{0},\mathbf{y}_{0})\in\mathcal{D}^{\text{ts}}_{j}}\mathcal{L}(\boldsymbol{\theta},\mathbf{w}^{*}_{j};\mathbf{x},\mathbf{y}_{0})\\ &-\gamma c_{\boldsymbol{\theta}}((\mathbf{x}_{0},\mathbf{y_{0}}),(\mathbf{x},\mathbf{y}_{0}))\},\end{split} (11)

where 𝐰j=argmin𝐰=(𝐰;𝜽,𝒟jtr)\mathbf{w}^{*}_{j}=\mathop{\arg\min}_{\mathbf{w}}=\mathcal{L}(\mathbf{w};\boldsymbol{\theta},\mathcal{D}^{\text{tr}}_{j}) and 𝐱\mathbf{x} is the pseudo data generated by FiSL.

Minimization phase. With the learned FiSL, the task 𝒯i\mathcal{T}_{i} can be transformed to a pseudo task 𝒯i^\hat{\mathcal{T}_{i}} from the other domains, which is used to optimize the meta-parameter 𝜽\boldsymbol{\theta}, such as

𝜽=𝜽α𝜽(𝜽,𝐰i;𝒟^its),\boldsymbol{\theta}=\boldsymbol{\theta}-\alpha\nabla_{\boldsymbol{\theta}}\mathcal{L}(\boldsymbol{\theta},\mathbf{w}_{i}^{*};\hat{\mathcal{D}}^{\text{ts}}_{i}), (12)

where 𝐰i=min𝐰(𝐰;𝜽,𝒟^itr)\mathbf{w}_{i}^{*}=\min_{\mathbf{w}}\mathcal{L}(\mathbf{w};\boldsymbol{\theta},\hat{\mathcal{D}}^{\text{tr}}_{i}), 𝒟^itr=FiSLϕ(𝒟itr)\hat{\mathcal{D}}^{\text{tr}}_{i}=\text{FiSL}_{\boldsymbol{\phi}^{\prime}}(\mathcal{D}^{\text{tr}}_{i}) and 𝒟^its=FiSLϕ(𝒟its)\hat{\mathcal{D}}^{\text{ts}}_{i}=\text{FiSL}_{\boldsymbol{\phi}^{\prime}}(\mathcal{D}^{\text{ts}}_{i}).

4.2.2 How to Learn Domain-invariant Knowledge

For learning domain-invariant meta-knowledge, inspired by multi-task learning [50], besides optimizing 𝜽\boldsymbol{\theta} by the pseudo task 𝒯^i\hat{\mathcal{T}}_{i} in Eq. 12, we sample an additional task 𝒯k\mathcal{T}_{k} from the source domain to jointly optimize 𝜽\boldsymbol{\theta}, such as

min𝜽(𝜽,𝐰k;𝒟kts)+(𝜽,𝐰i;𝒟^its),\min_{\boldsymbol{\theta}}\mathcal{L}(\boldsymbol{\theta},\mathbf{w}^{*}_{k};\mathcal{D}^{\text{ts}}_{k})+\mathcal{L}(\boldsymbol{\theta},\mathbf{w}^{*}_{i};\hat{\mathcal{D}}^{\text{ts}}_{i}), (13)

where 𝐰k=min𝐰(𝐰;𝜽,𝒟ktr)\mathbf{w}^{*}_{k}=\min_{\mathbf{w}}\mathcal{L}(\mathbf{w};\boldsymbol{\theta},\mathcal{D}^{\text{tr}}_{k}) and 𝐰i=min𝐰(𝐰;𝜽,𝒟^itr)\mathbf{w}_{i}^{*}=\min_{\mathbf{w}}\mathcal{L}(\mathbf{w};\boldsymbol{\theta},\hat{\mathcal{D}}^{\text{tr}}_{i}). 𝒟^itr\hat{\mathcal{D}}^{\text{tr}}_{i} and 𝒟^its\hat{\mathcal{D}}^{\text{ts}}_{i} is transformed by FiSL learned by Eq. 11.

Moreover, for learning cross-domain meta-knowledge, FiSL is supposed to dynamically generate various unseen domains based on different tasks. Hence, similar to MAML, we learn a good initialization for FiSL. From the learned initialization, an unseen domain can be simulated by a few gradient descents by Eq. 11. In particular, the good initialization is appropriate for simulating many unseen domains, in the sense that the learned initialization can be regarded as including domain-invariant knowledge. In the meta-test stage, we can transform the data from unseen domains by this initialization to achieve better generalization. The full algorithm is summarized in Algorithm 1. The overview of our method can be found in Figure. 1.

Input: Sample three sets of training tasks 𝒮1={𝒯3i2}i=1N,𝒮2={𝒯3i1}i=1N\mathcal{S}_{1}=\{\mathcal{T}_{3i-2}\}_{i=1}^{N},\mathcal{S}_{2}=\{\mathcal{T}_{3i-1}\}_{i=1}^{N} and 𝒮3={𝒯3i}i=1N\mathcal{S}_{3}=\{\mathcal{T}_{3i}\}_{i=1}^{N}
Output: Learned weights 𝜽,ϕ\boldsymbol{\theta},\boldsymbol{\phi}
1 Initialize 𝜽,ϕ\boldsymbol{\theta},\boldsymbol{\phi}
2 while not converged do
3       for  i = 1, …, N do
4             Train a base-learner 𝐰3i2\mathbf{w}^{*}_{3i-2} for 𝒯3i2\mathcal{T}_{3i-2} with 𝒟3i2tr\mathcal{D}^{\text{tr}}_{3i-2}
5             Use 𝒯3i2\mathcal{T}_{3i-2} to obtain ϕ\boldsymbol{\phi}^{\prime} via Eq. 11
6             Generate a pseudo tasks 𝒯^3i1\hat{\mathcal{T}}_{3i-1} based on 𝒯3i1\mathcal{T}_{3i-1}, 𝒟^3i1tr=FiSLϕ(𝒟3i1tr)\hat{\mathcal{D}}^{\text{tr}}_{3i-1}=\text{FiSL}_{\boldsymbol{\phi}^{\prime}}(\mathcal{D}^{\text{tr}}_{3i-1}) and 𝒟^3i1ts=FiSLϕ(𝒟3i1ts)\hat{\mathcal{D}}^{\text{ts}}_{3i-1}=\text{FiSL}_{\boldsymbol{\phi}^{\prime}}(\mathcal{D}^{\text{ts}}_{3i-1})
7             Train base-learners for pseudo task 𝒯^3i1\hat{\mathcal{T}}_{3i-1} and task 𝒯3i\mathcal{T}_{3i}, respectively.
8             Update 𝜽,ϕ\boldsymbol{\theta},\boldsymbol{\phi} via multi-task loss in Eq. 13
𝜽𝜽α𝜽[(𝜽,𝐰3i;𝒟3its)\displaystyle\boldsymbol{\theta}\leftarrow\boldsymbol{\theta}-\alpha\nabla_{\boldsymbol{\theta}}[\mathcal{L}(\boldsymbol{\theta},\mathbf{w}^{*}_{3i};\mathcal{D}^{\text{ts}}_{3i})
+(𝜽,𝐰3i1;𝒟^3i1ts)]\displaystyle~{}~{}~{}~{}~{}~{}+\mathcal{L}(\boldsymbol{\theta},\mathbf{w}^{*}_{3i-1};\hat{\mathcal{D}}^{\text{ts}}_{3i-1})]
ϕϕαϕ[(𝜽,𝐰3i;𝒟3its)\displaystyle\boldsymbol{\phi}\leftarrow\boldsymbol{\phi}-\alpha\nabla_{\boldsymbol{\phi}}[\mathcal{L}(\boldsymbol{\theta},\mathbf{w}^{*}_{3i};\mathcal{D}^{\text{ts}}_{3i})
+(𝜽,𝐰3i1;𝒟^3i1ts)]\displaystyle~{}~{}~{}~{}~{}~{}+\mathcal{L}(\boldsymbol{\theta},\mathbf{w}^{*}_{3i-1};\hat{\mathcal{D}}^{\text{ts}}_{3i-1})]
9       end for
10      
11 end while
Algorithm 1 Learning-to-learn Adversarial Shift

Note: When Algorithm 1 is applied to metric-based meta-learning method, it’s not necessary to train a base learner of each task, i.e., step 4 and 7.

In the meta-test stage, the learned 𝜽\boldsymbol{\theta}^{*} and ϕ\boldsymbol{\phi}^{*} can help meta-learning method achieve good generalization on the new task 𝒯l\mathcal{T}_{l} from the unseen domains, as

𝐰l=min𝐰(𝐰;𝜽,𝒟^ltr),𝒟^ltr=FiSLϕ(𝒟ltr).\mathbf{w}_{l}=\min_{\mathbf{w}}\mathcal{L}(\mathbf{w};\boldsymbol{\theta}^{*},\hat{\mathcal{D}}^{\text{tr}}_{l}),\hat{\mathcal{D}}^{\text{tr}}_{l}=\text{FiSL}_{\boldsymbol{\phi}^{*}}(\mathcal{D}^{\text{tr}}_{l}).\vspace{-2mm} (14)
Refer to caption
Figure 2: Results of cross-domain few-shot regression. Top: results of ANIL and ANIL-FiSL (our). Below: results of MAML and MAML-FiSL (our). Redline is the ground truth. Our method can help different meta-learning models achieve better performance on the unseen domain.

5 Experiments

In this section, we evaluate our method in different learning problems, including regression and classification to verify the adaptation of the proposed method. In each setting, our method is used to improve the generalization of different metric-based and gradient-based meta-learning models on unseen domains to demonstrate that our method is model-agnostic and can indeed help learn robust meta-knowledge.

5.1 Cross-domain Regression

Experimental Setup. We start with a regression problem of fitting sine curves following [10]. Each task involves regressing from the input to the output of a sine wave. The amplitude and phase of the training tasks are uniformly sampled from [0.1,3.0][0.1,3.0] and [0,34π][0,\frac{3}{4}\pi], respectively. However, the amplitude and phase of test tasks are uniformly drawn from [3.0,5.0][3.0,5.0] and [34π,π][\frac{3}{4}\pi,\pi], respectively. For training, five labeled datapoints are given for each task as 𝒟tr\mathcal{D}^{\text{tr}}, and twenty labeled datapoints are sampled as 𝒟ts\mathcal{D}^{\text{ts}}. Both of them are uniformly sampled from [5,5][-5,5]. We use a neural network with two hidden layers as the feature encoder and 40 nodes each and a mean-squared error (MSE) as loss \mathcal{L}.

Our method is applied to two gradient-based methods: MAML [10] and ANIL [27]. Because, our method is trained by multi-task mechanism, for fairness, the batchsize of tasks to train MAML and ANIL is set to 2. All models are trained 20,00020,000 iterations by Adam with a learning rate of 0.001. Both MAML and ANIL use one inner gradient step with a learning rate 0.01 and 0.1, respectively. During the test, we present the model with 2,0002,000 newly sampled tasks from the disjoint domain and measure mean squared error over 100 test points on each task. γ\gamma and η\eta in our method are 0.50.5 and 0.010.01, respectively.

Results. Table 1 shows the results of different models on the cross-domain few-shot regression. We can see that our method effectively improves the generalization of different gradient-based models on the unseen domain. The results also verify that our method can work well in regression problem. Figure. 2 represents some results of these models on some new tasks from the unseen domain.

Methods FiSL 5-shot 10-shot
ANIL [27] - 4.256 ±\pm 0.127 3.080 ±\pm 0.075
\surd 1.889 ±\pm 0.075 0.961 ±\pm 0.034
MAML [10] - 3.558 ±\pm 0.087 2.168 ±\pm 0.060
\surd 1.712 ±\pm 0.075 0.935 ±\pm 0.035
Table 1: Mean squared error (MSE) of cross-domain few-shot regression, lower is better. FiSL indicates that we apply the shift layer with the proposed adversarial mechanism to train the model.

5.2 Cross-domain Classification

5way 1-shot FiSL CUB Car ISIC ChestX
ProNet - 40.03 ±\pm 0.58 30.60 ±\pm 0.48 30.63 ±\pm 0.47 22.20 ±\pm 0.33
\surd 40.94 ±\pm 0.58 31.11 ±\pm 0.48 30.52 ±\pm 0.47 22.43 ±\pm 0.35
RelationNet - 39.30 ±\pm 0.56 28.34 ±\pm 0.43 29.64 ±\pm 0.46 22.12 ±\pm 0.33
\surd 40.29 ±\pm 0.57 29.00 ±\pm 0.46 28.75 ±\pm 0.44 21.92 ±\pm 0.32
ANIL - 32.11 ±\pm 0.55 26.58 ±\pm 0.41 24.45 ±\pm 0.34 21.19 ±\pm 0.25
\surd 39.49 ±\pm 0.59 30.21 ±\pm 0.50 31.09 ±\pm 0.47 21.88 ±\pm 0.32
MetaOpt - 42.33 ±\pm 0.60 32.04 ±\pm 0.47 29.61 ±\pm 0.45 22.10 ±\pm 0.31
\surd 45.53 ±\pm 0.61 34.67 ±\pm 0.51 33.82 ±\pm 0.51 22.91 ±\pm 0.33
R2D2 - 42.81 ±\pm 0.62 33.15 ±\pm 0.49 30.04 ±\pm 0.46 22.35 ±\pm 0.32
\surd 44.14 ±\pm 0.60 34.18 ±\pm 0.51 32.47 ±\pm 0.49 23.02 ±\pm 0.34
5-way 5-shtot FiSL CUB Car ISIC ChestX
ProNet - 57.26±\pm 0.57 42.83 ±\pm 0.53 39.89 ±\pm 0.43 25.03 ±\pm 0.34
\surd 57.84 ±\pm 0.56 43.11 ±\pm 0.54 41.47 ±\pm 0.43 25.40 ±\pm0.35
RelationNet - 55.71 ±\pm 0.56 37.86 ±\pm 0.51 38.10 ±\pm 0.43 24.11 ±\pm 0.32
\surd 56.11 ±\pm 0.53 39.66 ±\pm 0.55 38.39 ±\pm 0.45 23.73 ±\pm 0.32
ANIL - 37.24 ±\pm 0.57 28.79 ±\pm 0.40 27.90 ±\pm 0.38 20.93 ±\pm 0.17
\surd 57.56 ±\pm 0.58 44.34 ±\pm 0.57 41.82 ±\pm 0.45 25.00 ±\pm 0.34
MetaOpt - 61.66 ±\pm 0.60 50.55 ±\pm 0.56 44.10 ±\pm 0.47 26.36 ±\pm 0.35
\surd 63.24 ±\pm 0.57 50.82 ±\pm 0.58 46.39 ±\pm 0.46 26.46 ±\pm 0.34
R2D2 - 62.31 ±\pm 0.59 49.49 ±\pm 0.57 42.81 ±\pm 0.44 25.92 ±\pm 0.34
\surd 64.62 ±\pm 0.58 51.62 ±\pm 0.60 47.62 ±\pm 0.47 26.48 ±\pm 0.36
Table 2: Few-shot classification results on unseen domains. We train the model on the mini-ImageNet domain and evaluate the trained model on other domains. FiSL is our method.

In cross-domain few-shot classification, we validate the efficacy of the proposed methods with two categories of meta-learning frameworks, i.e., metric-based and gradient-based frameworks. In metric-based method, we choose ProNet [35] and RelationNet [37]. As for gradient-based method, ANIL, R2D2 [4], and MetaOptNet [18] are chosen. In order to evaluate the performance on unseen domains, we train the few-shot classification model on the mini-ImageNet [41] domain and evaluate the trained model on four different domains: CUB [43], Cars [17], ISIC [39], and ChestX [45]. CUB and Car are two benchmarks which are well established for cross-domain few-shot classification. Evaluating on these two benchmarks can provide a fair comparison to the previous methods. However, the images of these two datasets are natural images that still retain a high degree of visual similarity. Moreover, according to [12], some previous state-of-the-art methods, e.g. [40] are not robust to the large domain shift. Therefore, following [12], we adopt ISIC and Chexst as the other two benchmarks. Some images from these datasets are shown in Figure. 3.

DataSets. We conduct experiments using five datasets: mini-ImageNet, CUB, Cars, ISIC, and ChestX. We follow the same dataset processing in [12, 40]. Compared with natural images in CUB and Car, ISIC and ChestX cover dermoscopic images of skin lesions, and X-ray images respectively, which are largely different to mini-ImageNet. Similar to [40], we select the training iterations with the best accuracy on the validation set of mini-ImageNet for evaluation.

Refer to caption
Figure 3: Images from different benchmarks. mini-ImageNet is used as the source domain, and domains of varying dissimilarity from natural images are used for target evaluation.
Method Pre-trained FiSL CUB Car ISIC ChestX
ProNet 51.82 ±\pm 0.58 42.12 ±\pm 0.56 39.41 ±\pm 0.43 25.11 ±\pm 0.35
\surd 57.26±\pm 0.57 42.83 ±\pm 0.53 39.89 ±\pm 0.43 25.03 ±\pm 0.34
\surd \surd 57.84 ±\pm 0.56 43.11 ±\pm 0.54 41.47 ±\pm 0.43 25.40 ±\pm0.35
R2D2 61.19 ±\pm 0.60 43.91 ±\pm 0.59 40.57 ±\pm 0.43 25.10 ±\pm 0.32
\surd 62.31 ±\pm 0.59 49.49 ±\pm 0.57 42.81 ±\pm 0.44 25.92 ±\pm 0.34
\surd \surd 64.62 ±\pm 0.58 51.62 ±\pm 0.60 47.62 ±\pm 0.47 26.48 ±\pm 0.36
Table 3: 55-way 55-shot classification accuracy on different unseen domains. This result shows the influence of pre-training on generalization of different meta-learning methods.

Implementation details. All experiments are performed with ResNet-10 [13] for a fair comparison. Same as [12, 40], we firstly pre-train ResNet-10 by minimizing the standard cross-entropy classification loss on the 6464 training categories in the mini-ImageNet dataset. Then all the models are trained 30,00030,000 iterations by Adam [16] with a learning rate of 0.0010.001. Because our method uses multi-task training, the batchsize is set to 22 for training ProNet, RelationNet, ANIL, R2D2, and MetaOpt. The inner learning rate and updating steps are 0.10.1 and 55 for ANIL. γ\gamma and η\eta in our method are 0.50.5 and 0.10.1, respectively.

We present the average results over 1,0001,000 trials for all the experiments, and report the average accuracy and 95%95\% confidence interval. In each trial, the query set 𝒟ts\mathcal{D}^{\text{ts}} contains 1515 images.

Method Shot CUB Car
GNN\text{GNN}^{\dagger} [32] 1 45.69 ±\pm 0.68 31.79 ±\pm 0.51
GNN-FWT [40] 1 47.47 ±\pm 0.75 31.61 ±\pm 0.53
LRP-CAN [36] 1 46.23 ±\pm 0.42 32.66 ±\pm 0.46
R2D2-FiSL 1 44.14 ±\pm 0.60 34.18 ±\pm 0.51
MetaOpt-FiSL 1 45.53 ±\pm 0.60 34.67 ±\pm 0.51
GNN\text{GNN}^{\dagger} [32] 5 62.25 ±\pm 0.65 44.28 ±\pm 0.63
GNN-FWT [40] 5 66.98 ±\pm 0.68 44.90 ±\pm 0.64
LRP-CAN [36] 5 66.58 ±\pm 0.39 43.86 ±\pm 0.38
MetaOpt-FiSL 5 63.24 ±\pm 0.57 50.82 ±\pm 0.58
R2D2-FiSL 5 64.62 ±\pm 0.58 51.62 ±\pm 0.60
Table 4: 55-way KK-shot classification accuracy on CUB and Car. Results reported in [12].

Generalization with FiSL. Table 2 shows the results of five meta-learning models with FiSL or not on 55way-11 shot and 55way-55shot cross-domain few-shot classification. Both the metric-based and gradient-based models trained with our method perform favorably against the individual baselines. This observation demonstrates that our method is model agnostic and robust to different unseen domains of varying dissimilarity from the source domain. We attribute the improvement of the generalization to the use of FiSL for making the meta-learner really learn the domain-invariant meta-knowledge.

We also obverse that gradient-based methods achieve better generalization than metric-based methods. This might due to the gradient-based methods learn a base learner for the new task with the provided labeled data. The defined metric space learned in the source domain by metric-based methods is not flexible compared with adapting learner on unseen domains.

Method Shot ISIC ChestX
ProNet\text{ProNet}^{\dagger} 5 39.57 ±\pm 0.57 24.05 ±\pm 1.01
ProNet-FWT\text{ProNet-FWT}^{\dagger} [40] 5 38.87 ±\pm 0.52 23.77 ±\pm 0.42
RN\text{RN}^{\dagger} 5 39.41 ±\pm 0.58 22.96 ±\pm 0.88
RN-FWT\text{RN-FWT}^{\dagger} [40] 5 35.54 ±\pm 0.55 22.74 ±\pm 0.40
MAML\text{MAML}^{\dagger} 5 40.13 ±\pm 0.58 23.48 ±\pm 0.96
CHEF [1] 5 41.26 ±\pm 0.34 24.72 ±\pm 0.14
Fixed\text{Fixed}^{\dagger} [12] 5 43.56 ±\pm 0.60 25.35 ±\pm 0.96
MetaOpt-FiSL 5 46.39 ±\pm 0.46 26.46 ±\pm 0.34
R2D2-FiSL 5 47.62 ±\pm 0.47 26.48 ±\pm 0.36
ProNet-FWT\text{ProNet-FWT}^{\dagger} [40] 20 43.78 ±\pm 0.47 26.87 ±\pm 0.43
RN-FWT\text{RN-FWT}^{\dagger} [40] 20 43.31 ±\pm 0.51 26.75 ±\pm 0.41
CHEF [1] 20 54.30 ±\pm 0.34 29.71 ±\pm 0.27
Fixed\text{Fixed}^{\dagger} [12] 20 52.78 ±\pm 0.39 30.83 ±\pm 1.05
MetaOpt-FiSL 20 55.34 ±\pm 0.44 30.59 ±\pm 0.35
R2D2-FiSL 20 58.74 ±\pm 0.46 31.51 ±\pm 0.36
Table 5: 55-way KK-shot classification accuracy on ISIC and ChestX. RN indicates RelationNet. Fixed (Fixed feature extractor) in [12] is a strong baseline that leverages the pre-trained model as a fixed feature extractor and a linear model as the classifier. Many meta-learning models can’t outperform it when exists a large domain shift. Results reported in [12].

Comparison to previous state of the arts. Table 4 and Table 5 show the results. All the models are trained on mini-ImageNet and evaluate on the other four benchmarks. First, we obverse GNN+FWT achieves the best performance on CUB, however, degrades on Car. Meanwhile, our method achieves competitive performance on CUB and outperforms GNN-FWT 3.06%3.06\% and 6.72%6.72\% on 1-shot and 5-shot on Car, respectively. We think that CUB is a fine-grained bird dataset, which has the highest similarity to the mini-ImageNet in semantic content and color style among these four datasets, as shown in Figure. 3. Some recent few-shot learning methods [2] can even handle this situation with a small domain shift. From this point, learning well enough on mini-ImageNet could provide satisfying performance on CUB.

On ISIC and ChestX, our method achieves state-of-the-art performance and largely outperforms the previous methods. We also obverse FWT can’t improve the performance of ProNet and RelationNet. Some similar results of our method can be found in Table 2. Note that the decline of our methods is less than FWT. In this regard, it’s challenging to improve the generalization of the metric-based meta-learning on some domains existing a large shift.

Moreover, some recent works [38, 5] point out that meta-learning based few-shot learning algorithms underperform compared to the traditional pre-training model when there exists a large shift between base and novel class domains. Comparing the performance of MAML, ProNet with Fixed, we can find the same results in Table 5. However, based on our method, some meta-learning methods can largely outperform Fixed.

5.3 Influence of Pre-training on the Generalization

According to several recent methods [49, 30], pre-training can significantly improve the performance of meta-learning frameworks on the typical few-shot learning scenario. In this section, we investigate the influence of pre-training on the generalization on unseen domains.

As shown in Table 3, pre-training the feature encoder substantially improves the performance of ProNet and R2D2 on four unseen benchmarks. However, the influence of pre-training on ProNet is not very obvious, when there exists a large domain shift in target domains.

6 Conclusion

We propose a model-agnostic method to effectively enhance the generalization of different kinds of meta-learning frameworks under the domain shift, which can be applied to many learning problems. The core idea of our method lies in using the feature-wise shift layer to simulate various distributions existing in unseen domains. In order to learn how to simulate the distributions and learn domain-invariant knowledge, We develop a learning-to-learn approach for jointly optimizing the proposed feature-wise shift layer and the meta-learning model. From extensive experiments, we demonstrate that our method is applicable to different meta-learning frameworks and shows consistent improvement over the baselines and robustness to different unseen domains.

References

  • [1] Thomas Adler, Johannes Brandstetter, Michael Widrich, Andreas Mayr, David Kreil, Michael Kopp, Günter Klambauer, and Sepp Hochreiter. Cross-domain few-shot learning by representation fusion. arXiv preprint arXiv:2010.06498, 2020.
  • [2] Arman Afrasiyabi, Jean-François Lalonde, and Christian Gagné. Associative alignment for few-shot image classification. In ECCV, pages 18–35, 2020.
  • [3] Yogesh Balaji, Swami Sankaranarayanan, and Rama Chellappa. Metareg: Towards domain generalization using meta-regularization. In NeurIPS, pages 1006–1016, 2018.
  • [4] Luca Bertinetto, João F. Henriques, Philip H. S. Torr, and Andrea Vedaldi. Meta-learning with differentiable closed-form solvers. In ICLR, 2019.
  • [5] Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, and Jia-Bin Huang. A closer look at few-shot classification. arXiv preprint arXiv:1904.04232, 2019.
  • [6] Benoît Colson, Patrice Marcotte, and Gilles Savard. An overview of bilevel optimization. Annals of operations research, 153(1):235–256, 2007.
  • [7] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995.
  • [8] Miroslav Dudík, Robert E. Schapire, and Steven J. Phillips. Correcting sample selection bias in maximum entropy density estimation. In NIPS, pages 323–330, 2005.
  • [9] Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. A learned representation for artistic style. In ICLR, 2017.
  • [10] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, pages 1126–1135, 2017.
  • [11] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The journal of machine learning research, 17(1):2096–2030, 2016.
  • [12] Yunhui Guo, Noel Codella, Leonid Karlinsky, James V. Codella, John R. Smith, Kate Saenko, Tajana Rosing, and Rogério Feris. A broader study of cross-domain few-shot learning. In ECCV, pages 124–141, 2020.
  • [13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  • [14] Timothy M. Hospedales, Antreas Antoniou, Paul Micaelli, and Amos J. Storkey. Meta-learning in neural networks: A survey. arXiv preprint arXiv:2004.05439, 2020.
  • [15] Guoliang Kang, Lu Jiang, Yi Yang, and Alexander G. Hauptmann. Contrastive adaptation network for unsupervised domain adaptation. In CVPR, 2019.
  • [16] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
  • [17] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In ICCV workshops, pages 554–561, 2013.
  • [18] Kwonjoon Lee, Subhransu Maji, Avinash Ravichandran, and Stefano Soatto. Meta-learning with differentiable convex optimization. In CVPR, pages 10657–10665, 2019.
  • [19] Seungmin Lee, Dongwan Kim, Namil Kim, and Seong-Gyun Jeong. Drop to adapt: Learning discriminative features for unsupervised domain adaptation. In ICCV, pages 91–100, 2019.
  • [20] Da Li, Jianshu Zhang, Yongxin Yang, Cong Liu, Yi-Zhe Song, and Timothy M Hospedales. Episodic training for domain generalization. In ICCV, pages 1446–1455, 2019.
  • [21] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I. Jordan. Deep transfer learning with joint adaptation networks. In ICML, pages 2208–2217, 2017.
  • [22] Boris N. Oreshkin, Pau Rodríguez López, and Alexandre Lacoste. TADAM: task dependent adaptive metric for improved few-shot learning. In NeurIPS, pages 719–729, 2018.
  • [23] Prashant Pandey, Mrigank Raman, Sumanth Varambally, and Prathosh AP. Domain generalization via inference-time label-preserving target projections. arXiv preprint arXiv:2103.01134, 2021.
  • [24] Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron C. Courville. Film: Visual reasoning with a general conditioning layer. In AAAI, pages 3942–3951, 2018.
  • [25] Cheng Perng Phoo and Bharath Hariharan. Self-training for few-shot transfer across extreme task differences. arXiv preprint arXiv:2010.07734, 2020.
  • [26] Kun Qian and Zhou Yu. Domain adaptive dialog generation via meta learning. In ACL, pages 2639–2649, 2019.
  • [27] Aniruddh Raghu, Maithra Raghu, Samy Bengio, and Oriol Vinyals. Rapid learning or feature reuse? towards understanding the effectiveness of MAML. In ICLR, 2020.
  • [28] Kate Rakelly, Aurick Zhou, Chelsea Finn, Sergey Levine, and Deirdre Quillen. Efficient off-policy meta-reinforcement learning via probabilistic context variables. In ICML, pages 5331–5340, 2019.
  • [29] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In ICLR, 2017.
  • [30] Andrei A. Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon Osindero, and Raia Hadsell. Meta-learning with latent embedding optimization. In ICLR, 2019.
  • [31] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Meta-learning with memory-augmented neural networks. In ICML, pages 1842–1850, 2016.
  • [32] Victor Garcia Satorras and Joan Bruna Estrach. Few-shot learning with graph neural networks. In ICLR, 2018.
  • [33] Shiv Shankar, Vihari Piratla, Soumen Chakrabarti, Siddhartha Chaudhuri, Preethi Jyothi, and Sunita Sarawagi. Generalizing across domains via cross-gradient training. In ICLR, 2018.
  • [34] Aman Sinha, Hongseok Namkoong, and John Duchi. Certifiable distributional robustness with principled adversarial training. arXiv preprint arXiv:1710.10571, 2, 2017.
  • [35] Jake Snell, Kevin Swersky, and Richard S. Zemel. Prototypical networks for few-shot learning. In NIPS, pages 4077–4087, 2017.
  • [36] Jiamei Sun, Sebastian Lapuschkin, Wojciech Samek, Yunqing Zhao, Ngai-Man Cheung, and Alexander Binder. Explain and improve: Cross-domain few-shot-learning using explanations. arXiv preprint arXiv:2007.08790, 2020.
  • [37] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip H. S. Torr, and Timothy M. Hospedales. Learning to compare: Relation network for few-shot learning. In CVPR, pages 1199–1208, 2018.
  • [38] Yonglong Tian, Yue Wang, Dilip Krishnan, Joshua B Tenenbaum, and Phillip Isola. Rethinking few-shot image classification: a good embedding is all you need? arXiv preprint arXiv:2003.11539, 2020.
  • [39] Philipp Tschandl, Cliff Rosendahl, and Harald Kittler. The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific data, 5(1):1–9, 2018.
  • [40] Hung-Yu Tseng, Hsin-Ying Lee, Jia-Bin Huang, and Ming-Hsuan Yang. Cross-domain few-shot classification via learned feature-wise transformation. In ICLR, 2020.
  • [41] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. Matching networks for one shot learning. In NIPS, pages 3630–3638, 2016.
  • [42] Riccardo Volpi, Hongseok Namkoong, Ozan Sener, John C. Duchi, Vittorio Murino, and Silvio Savarese. Generalizing to unseen domains via adversarial data augmentation. In NeurIPS, pages 5339–5349, 2018.
  • [43] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. 2011.
  • [44] Mei Wang and Weihong Deng. Deep visual domain adaptation: A survey. Neurocomputing, 312:135–153, 2018.
  • [45] Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M Summers. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In CVPR, pages 2097–2106, 2017.
  • [46] Yaqing Wang, Quanming Yao, James T. Kwok, and Lionel M. Ni. Generalizing from a few examples: A survey on few-shot learning. ACM Comput. Surv., 53(3):63:1–63:34, 2020.
  • [47] Ruijia Xu, Guanbin Li, Jihan Yang, and Liang Lin. Larger norm more transferable: An adaptive feature norm approach for unsupervised domain adaptation. In ICCV, pages 1426–1435, 2019.
  • [48] Jun Yang, Rong Yan, and Alexander G Hauptmann. Cross-domain video concept detection using adaptive svms. In ACM MM, pages 188–197, 2007.
  • [49] Han-Jia Ye, Hexiang Hu, De-Chuan Zhan, and Fei Sha. Few-shot learning via embedding adaptation with set-to-set functions. In CVPR, pages 8805–8814, 2020.
  • [50] Yu Zhang and Qiang Yang. A survey on multi-task learning. arXiv preprint arXiv:1707.08114, 2017.
  • [51] Long Zhao, Ting Liu, Xi Peng, and Dimitris N. Metaxas. Maximum-entropy adversarial data augmentation for improved generalization and robustness. In NeurIPS, 2020.