When does MAML Work the Best? An Empirical Study on Model-Agnostic Meta-Learning in NLP Applications

Zequn Liu, Ruiyi Zhang, Yiping Song, Wei Ju, Ming Zhang
Department of Computer Science, School of EECS, Peking University, Beijing, China
zequnliu,zhangruiyi,songyiping,juwei,mzhang_cs@pku.edu.cn
^∗Corresponding author

Abstract

Model-Agnostic Meta-Learning (MAML) is successfully employed in Natural Language Processing applications including few-shot text classification and multi-domain low-resource language generation. Many impacting factors, including data quantity and data distribution, can affect the performance of MAML in NLP, but few works have thoroughly studied them. In this paper, we conduct an empirical study to investigate these impacting factors and conclude when MAML works the best based on the experimental results.

1 Introduction

In the field of Natural Language Processing (NLP), the abundance of training data plays a crucial role in the performance of deep learning models [Dodge et al., 2021]. However, numerous NLP applications face a substantial challenge due to the scarcity of annotated data [Schick and Schütze, 2021]. For example, in personalized dialogue generation, each user’s annotated data is limited, making it difficult to train a well-performing response generation model for each individual [Madotto et al., 2019, Song et al., 2020]. Many techniques have been employed to address the issue of data scarcity, including self-supervised pre-training [Achiam et al., 2023, OpenAI, 2022], transfer learning [Gero et al., 2018, Kumar et al., 2022], and meta-learning [Madotto et al., 2019, Song et al., 2020, Zhao et al., 2022]. Compared to other approaches, meta-learning focuses on designing models that learn to learn from small data sets, reducing the dependency on extensive pre-training data [Finn et al., 2017, Vilalta and Drissi, 2002]. Therefore, meta-learning has been widely applied in low-resource NLP tasks.

Model-Agnostic Meta-Learning (MAML) [Finn et al., 2017] is one of the most popular meta-learning methods. It is trained on plenty of tasks (i.e. small data sets) to get a parameter initialization which is easy to adapt to target tasks with a few samples. As a model-agnostic framework, MAML is successfully employed in different NLP applications. Some works use MAML for few-shot text classification, such as relation classification [Obamuyide and Vlachos, 2019] and topic classification [Bao et al., 2020]. Other works use MAML for multi-domain and low-resource language generation, such as few-shot dialogue system [Mi et al., 2019, Madotto et al., 2019, Qian and Yu, 2019, Song et al., 2020] and low-resource machine translation [Gu et al., 2018].

When applying MAML to NLP, several factors can influence the training strategy and performance of the model. Firstly, the data quantity within the datasets used as ”tasks” varies across different applications, which can impact the effectiveness of MAML [Serban et al., 2015, Song et al., 2020]. Secondly, while vanilla MAML assumes that the data distribution is the same across tasks, in real-world NLP tasks, the data distributions can differ significantly [Li et al., 2018, Balaji et al., 2018]. For example, PAML [Madotto et al., 2019] regards each person’s dialogues as a task for MAML and they have different personal profiles. This variation manifests both between training tasks and between training and testing tasks, similarly affecting the performance of MAML. Few works have thoroughly studied these impact factors.

In this paper, we take an empirical approach to systematically investigating these impacting factors and finding when MAML works the best. We conduct extensive experiments over 4 datasets. We first study the effects of data quantity and distribution on the training strategy: RQ1. Since the parameter initialization learned by MAML can be seen as a general language model of training tasks, when the training and testing tasks have different data distributions, how can the general language model training affect the model’s task-specific adaptation ability? RQ2. How do the data distribution and data quantity affect our decision of fine-tuning epoch number? Then we study the effects of these factors on the model performance: RQ3. How do the data quantity and data distribution affect the performance of MAML?

The experimental results provide insights on MAML in NLP applications. Our conclusions help researchers to better develop meta-learning methods in NLP.

2 Preliminaries

2.1 Meta-learning Problem Definition

In meta-learning, we have multiple tasks $T$ sampled from distribution $p(\mathcal{T})$ [Ravi and Larochelle, 2017, Andrychowicz et al., 2016, Santoro et al., 2016]. For each task $T_{i}$ , we train a base model $f_{i}^{\theta}$ with parameter $\theta_{i}$ on its training corpus $D_{i}^{train}$ which only has a few samples, and evaluate the model on the testing corpus $D_{i}^{valid}$ . We divide the tasks into meta-training, meta-validation, and meta-testing. The goal of meta-learning is that after training on meta-training, we can quickly find $f_{i}^{\theta}$ via fine-tuning (adaptation) with $D_{i}^{train}$ for each task $T_{i}$ in meta-testing.

2.2 Model-Agnostic Meta-Learning

Model-Agnostic Meta-Learning (MAML) [Finn et al., 2017] pre-trains a parameter initialization $\theta$ shared among tasks. At each training epoch, MAML samples a set of tasks ${T_{i}}{\sim}p(\mathcal{T})$ . For each task $T_{i}$ , MAML trains from the initialization $\theta$ on $D_{i}^{train}$ to get $\theta_{i}$ , then evaluates each $\theta_{i}$ on $D_{i}^{valid}$ and updates $\theta$ , which is,

\displaystyle\theta_{i}=\theta-\alpha\nabla_{\theta}\mathcal{L}_{D_{i}^{train}}(\theta),\theta=\theta-\beta\nabla_{\theta}\sum_{{T_{i}}{\sim}p(\mathcal{T})}\mathcal{L}_{D_{i}^{valid}}(\theta_{i})

(1)

where $\mathcal{L}_{D_{i}^{train}}(\theta)$ and $\mathcal{L}_{D_{i}^{valid}}(\theta_{i})$ are the loss functions of $\theta$ on $D_{i}^{train}$ and $\theta_{i}$ on $D_{i}^{valid}$ , $\alpha$ and $\beta$ are the learning rates. In fine-tuning stage, each task fine-tunes from the pre-trained initialization $\theta$ on its $D_{i}^{train}$ .

3 Experimental Setup

3.1 Datasets

In Experiment I: Text Classification, we use FewRel [Han et al., 2018] and Amazon [He and McAuley, 2016]. They are datasets for 5-way 5-shot classification, which means 5 classes are randomly sampled from the full dataset for each task, and each class has 5 samples. FewRel is a relation classification dataset with 65/5/10 tasks for meta-training/meta-validation/meta-testing. Amazon is a large customer review dataset with 24 product categories. We follow [Bao et al., 2020] to sample 1000 reviews from each category and use 10/5/9 tasks for meta-training/meta-validation/meta-testing.

In Experiment II: Dialogue Generation, we use Persona [Zhang et al., 2018] and Weibo, regarding building a dialogue model for a user as a task. Persona is a personalized dialogue dataset with 1137/99/100 users for meta-training/meta-validation/meta-testing. Each user has 121 utterances on average. Weibo is a personalized dialogue dataset collected from Weibo conversations with 371/40/38 users for meta-training/meta-validation/meta-testing. Each user has 1200 utterances on average.

3.2 Base Models and Settings

In each experiment, we use the same base model among the 2 datasets. In text classification experiments, we use the CNN proposed in [Bao et al., 2020] as the base model and follow the hyperparameter settings. We use Transformer [Vaswani et al., 2017] as the base model in dialogue generation experiment. In Persona, we use pre-trained Glove embedding [Pennington et al., 2014]. In Weibo, we use Gensim [Rehurek and Sojka, 2010]. We follow the other hyperparameter settings in [Madotto et al., 2019].

3.3 Evaluation Metrics

In text classification experiment, we use accuracy (Acc) to evaluate the classification performance. In dialogue generation experiment, we evaluate the performance of MAML in terms of quality and personality. We use PPL and BLEU [Papineni et al., 2002] to measure the similarity between the reference and the generated response, and use C Score [Madotto et al., 2019] to measure the personality. In Persona we use a pre-trained natural language inference model to measure the response consistency with persona description for C Score. In Weibo, users do not have persona descriptions, so we pre-train a user classifier to classify the generated responses, and use the accuracy for C Score.

4 Results and Analysis

4.1 Trade-off Problem between General Language Model and Task-specific Adaptation

To answer RQ1, we compare the changing trend of the general language model and the task-specific adaptation ability during the training of MAML to find whether there is a trade-off problem. (Figure 1) We select the trained parameter initialization at different MAML training epochs and evaluate them directly on the meta-testing set before fine-tuning, using the quality performance (accuracy for classification and BLEU for generation) to evaluate the general language model. Then we use the personality performance (accuracy for classification and C Score for generation) after fine-tuning to measure each model’s ability to adapt to specific tasks.

Refer to caption — (a) C Score and BLEU on Persona

When the training epochs increase, the model’s ability of task-specific adaptation reaches the peak earlier than the quality of its general language model does. In Persona, BLEU before fine-tuning reaches the peak at epoch 3000, but after epoch 1500, the final C Score drops. In Weibo, the changing trends are consistent with the results on Persona. In FewRel and Amazon, the general language model first becomes better and then over-fits to the training data, but the final accuracy decreases after epoch 1.

The finding suggests that parameter initialization at the late training stage has strong general language generation ability, but performs comparative poorly in task-specific adaptation. Although in the early training stage, the performance improves benefiting from the pre-trained general language model, if the language model becomes too “general”, it will lose the ability of adapting to specific tasks. It is noteworthy that the ”too general” problem is not the same as over-fitting, since the ”too general” model performs well before fine-tuning, which means it does not over-fit to the training data.

4.2 Impact of Data Quantity and Task Profile on Fine-tuning

To answer RQ2, we find the fine-tuning epochs for each task in Persona where its BLEU and C Score reaches the best respectively to find the impact of data quantity and the task profile (persona description) on fine-tuning. (Table 1) We cluster the tasks with similar best fine-tuning epoch number and calculate the average dialogue quantity and task profile similarity for each cluster. We define Jac Score of the persona descriptions to evaluate their task profile similarity as $\mathit{Jac~{}Score=\Sigma_{i=1}^{N}Jac_{i}/(NJac_{whole})}$ , where $Jac_{i}$ is the Jaccard similarity of the persona descriptions in cluster i, $Jac_{whole}$ is the Jaccard similarity of all the persona descriptions, and N is the number of clusters. If Jac Score is larger than 1, the persona descriptions are similar to each other in each cluster. Larger Jac Score means better similarity.

For both BLEU and C Score, Jac Score is around 1 in each cluster, which means the persona descriptions are not similar. The dialogue quantity also seems similar among different clusters. So we can conclude that data quantity and task profile does not have a major impact on the fine-tuning process.

Cluster	BLEU			C Score
Cluster	Mean Fine-tuning Epochs	Task Similarity	Dialogue Quantity	Mean Fine-tuning Epochs	Task Similarity	Dialogue Quantity
1	0.22	1.05	9.31	2.50	1.00	9.05
2	1.16	1.04	9.63	5.83	1.06	9.89
3	2.07	1.01	9.94	8.86	1.09	10.05
4	3.68	1.00	9.63	10.25	1.05	9.16
5	5.58	1.00	9.85	12.86	1.00	10.24

Table 1: The average dialogue quantity and task profile similarity of the task clusters grouped by each task’s “best fine-tuning epochs” of BLEU and C (where its BLEU and C reaches the best) in Persona.

4.3 Impact of Data Quantity and Task Similarity on the performance of MAML

To answer RQ3, we conduct experiments on different data quantity and task similarity settings. We compare two baselines with MAML : Transformer/CNN, which pre-trains the base model (Transformer/CNN) on the meta-training set and evaluates directly on the meta-testing set, and Transformer/CNN-F, which fine-tunes Transformer/CNN on each meta-testing task.

Data Quantity. In Persona, we evaluate Transformer/CNN, Transformer/CNN-F and MAML on 3 data quantity settings: 50/100/120-shot (each task has 50, 100, 120 utterances on average). In Weibo, FewRel and Amazon, the settings are 500/1000/1500-shot, 3/4/5-shot and 3/4/5-shot respectively (Table 2). When the data quantity is small, the advantage of MAML is more significant. In Persona, the C Score and BLEU of MAML outperform baselines on 50-shot and 100-shot settings, but on 120-shot setting, the BLEU of MAML is lower than Transformer-F. In Weibo, FewRel and Amazon, the percentages that MAML outperforms the baselines by also decrease as the data quantity increasing. This finding is in line with the mechanism of MAML. MAML finds a sensitive parameter initialization that can adapt with few data samples [Finn et al., 2017]. When there are enough data samples, fine-tuning also performs well, so BLEU of Transformer-F in Persona on 120-shot setting is even better.

Method	Persona			Weibo			FewRel	Amazon
Method	PPL	C Score	BLEU	PPL	C Score	BLEU	Acc	Acc
	50-shot			500-shot			3-shot	3-shot
Transformer/CNN	96.56	-0.07	0.11	36.92	0.09	1.94	0.17	0.18
Transformer/CNN-F	48.46	-0.04	0.46	34.21	0.15	2.50	0.54	0.50
MAML	55.45	0.10	0.55	53.01	0.21	6.09	0.61	0.51
	100-shot			1000-shot			4-shot	4-shot
Transformer/CNN	36.83	-0.08	0.75	61.81	0.10	2.92	0.18	0.21
Transformer/CNN-F	31.21	-0.01	0.68	52.88	0.17	3.42	0.59	0.54
MAML	43.57	0.19	0.82	47.62	0.22	6.75	0.65	0.55
	120-shot			1200-shot			5-shot	5-shot
Transformer/CNN	36.54	-0.03	0.80	36.95	0.10	2.87	0.20	0.20
Transformer/CNN-F	34.28	0.08	0.89	35.02	0.20	5.17	0.66	0.55
MAML	56.21	0.27	0.84	41.88	0.22	8.39	0.68	0.55
	Similar			Similar			-	-
Transformer/CNN-F	35.93	-0.03	0.70	90.93	0.05	0.33	-	-
MAML	42.50	-0.02	0.63	92.79	0.04	0.38	-	-

Table 2: The performance on different data quantity and task similarity settings.

Task similarity. In Persona and Weibo, each task is a set of dialogues for one user, so tasks are different from each other. We shuffle the samples and randomly divide tasks to construct the setting that tasks are similar to each other. For a fair comparison, each task on this setting also has 120 and 1200 utterances on average in Persona and Weibo respectively. We train and evaluate Transformer-F and MAML on this setting. (Table 2). When tasks are similar to each other, MAML performs comparatively poorly. In Persona and Weibo, the performance of MAML is similar to that of Transformer-F, while MAML performs significantly better than Transformer-F when tasks are different. A possible explanation is that if there is no clear distinction between tasks, the meta-learning setting can be viewed as a transfer learning setting, which only has a source domain and a target domain, and fine-tuning performs well in transfer learning. So if the tasks are similar to each other, we can simply use Transformer-F rather than MAML.

5 Conclusion

In this paper, we conduct an empirical study to investigate the impacting factors on the performance of MAML in NLP applications. We show that MAML works the best when the general language model is not fully trained by MAML, the data quantity of each task is small and tasks are dissimilar with each other. We also point out that it is unnecessary to customize the fine-tuning epoch number for each task according to the task profile or data quantity. Our work sheds light on the application of MAML in NLP.

References

[Achiam et al., 2023] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
[Andrychowicz et al., 2016] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. 2016. Learning to learn by gradient descent by gradient descent. In NIPS, pages 3981–3989.
[Balaji et al., 2018] Yogesh Balaji, Swami Sankaranarayanan, and Rama Chellappa. 2018. Metareg: Towards domain generalization using meta-regularization. Advances in neural information processing systems, 31.
[Bao et al., 2020] Yujia Bao, Menghua Wu, Shiyu Chang, and Regina Barzilay. 2020. Few-shot text classification with distributional signatures. ICLR.
[Dodge et al., 2021] Jesse Dodge, Maarten Sap, Ana Marasović, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt Gardner. 2021. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. arXiv preprint arXiv:2104.08758.
[Finn et al., 2017] Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. ICML, pages 1126–1135.
[Gero et al., 2018] Katy Ilonka Gero, Giannis Karamanolakis, and Lydia Chilton. 2018. Transfer learning for style-specific text generation. In Proceedings of 32nd Conference on Neural Information Processing Systems, NIPS, volume 2018.
[Gu et al., 2018] Jiatao Gu, Yong Wang, Yun Chen, Kyunghyun Cho, and Victor OK Li. 2018. Meta-learning for low-resource neural machine translation. EMNLP.
[Han et al., 2018] Xu Han, Hao Zhu, Pengfei Yu, Ziyun Wang, Yuan Yao, Zhiyuan Liu, and Maosong Sun. 2018. Fewrel: A large-scale supervised few-shot relation classification dataset with state-of-the-art evaluation. In EMNLP, pages 4803–4809.
[He and McAuley, 2016] Ruining He and Julian McAuley. 2016. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In ICIWWW, pages 507–517.
[Kumar et al., 2022] Vinayshekhar Bannihatti Kumar, Rashmi Gangadharaiah, and Dan Roth. 2022. Unsupervised neural stylistic text generation using transfer learning and adapters. arXiv preprint arXiv:2210.03264.
[Li et al., 2018] Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy Hospedales. 2018. Learning to generalize: Meta-learning for domain generalization. In Proceedings of the AAAI conference on artificial intelligence, volume 32.
[Madotto et al., 2019] Andrea Madotto, Zhaojiang Lin, Chien-Sheng Wu, and Pascale Fung. 2019. Personalizing dialogue agents via meta-learning. In ACL, pages 5454–5459.
[Mi et al., 2019] Fei Mi, Minlie Huang, Jiyong Zhang, and Boi Faltings. 2019. Meta-learning for low-resource natural language generation in task-oriented dialogue systems. In IJCAI, pages 3151–3157.
[Obamuyide and Vlachos, 2019] Abiola Obamuyide and Andreas Vlachos. 2019. Model-agnostic meta-learning for relation classification with limited supervision. In ACL, pages 5873–5879.
[OpenAI, 2022] OpenAI. 2022. Chatgpt: Optimizing language models for dialogue. Technical blog.
[Papineni et al., 2002] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In ACL, pages 311–318.
[Pennington et al., 2014] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In EMNLP, pages 1532–1543.
[Qian and Yu, 2019] Kun Qian and Zhou Yu. 2019. Domain adaptive dialog generation via meta learning. In ACL, pages 2639–2649.
[Ravi and Larochelle, 2017] Sachin Ravi and Hugo Larochelle. 2017. Optimization as a model for few-shot learning.
[Rehurek and Sojka, 2010] Radim Rehurek and Petr Sojka. 2010. Software framework for topic modelling with large corpora. In LREC. Citeseer.
[Santoro et al., 2016] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. 2016. Meta-learning with memory-augmented neural networks. In ICML, pages 1842–1850.
[Schick and Schütze, 2021] Timo Schick and Hinrich Schütze. 2021. Few-shot text generation with natural language instructions. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 390–402.
[Serban et al., 2015] Iulian Vlad Serban, Ryan Lowe, Peter Henderson, Laurent Charlin, and Joelle Pineau. 2015. A survey of available corpora for building data-driven dialogue systems. arXiv preprint arXiv:1512.05742.
[Song et al., 2020] Yiping Song, Zequn Liu, Wei Bi, Rui Yan, and Ming Zhang. 2020. Learning to customize model structures for few-shot dialogue generation tasks. In ACL, pages 5832–5841.
[Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS, pages 5998–6008.
[Vilalta and Drissi, 2002] Ricardo Vilalta and Youssef Drissi. 2002. A perspective view and survey of meta-learning. Artificial intelligence review, 18:77–95.
[Zhang et al., 2018] Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing dialogue agents: I have a dog, do you have pets too? In ACL, pages 2204–2213.
[Zhao et al., 2022] Yingxiu Zhao, Zhiliang Tian, Huaxiu Yao, Yinhe Zheng, Dongkyu Lee, Yiping Song, Jian Sun, and Nevin Zhang. 2022. Improving meta-learning for low-resource text classification and generation via memory imitation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 583–595.