Language Models as Few-Shot Learner for
Task-Oriented Dialogue Systems

Andrea Madotto, Zihan Liu, Zhaojiang Lin, Pascale Fung
Center for Artificial Intelligence Research (CAiRE)
The Hong Kong University of Science and Technology
amadotto@connect.ust.hk

Abstract

Task-oriented dialogue systems use four connected modules, namely, Natural Language Understanding (NLU), a Dialogue State Tracking (DST), Dialogue Policy (DP) and Natural Language Generation (NLG). A research challenge is to learn each module with the least amount of samples (i.e., few-shots) given the high cost related to the data collection. The most common and effective technique to solve this problem is transfer learning, where large language models, either pre-trained on text or task-specific data, are fine-tuned on the few samples. These methods require fine-tuning steps and a set of parameters for each task. Differently, language models, such as GPT-2 Radford et al. (2019) and GPT-3 Brown et al. (2020), allow few-shot learning by priming the model with few examples. In this paper, we evaluate the priming few-shot ability of language models in the NLU, DST, DP and NLG tasks. Importantly, we highlight the current limitations of this approach, and we discuss the possible implication to future work.

Acknowledgments

I would like to thanks Jason Wu for providing an easy to use code in ToD-BERT and for clarification about the code and tasks, Baolin Peng for the easy to use repository FewShotNLG and for providing help with the scorer, and Sumanth Dathathri for the discussion and insight about the limitation of the LM priming few-shots.

1 Introduction

Modularized task-oriented dialogues systems are the core of the current smart speaker generation (e.g., Alexa, Siri etc.). The main modules of such systems are Natural Language Understanding (NLU), Dialogue State Tracking (DST), Dialogue Policy (DP) and Natural Language Generation (NLG), each of which is trained separately using supervised and/or reinforcement learning. Thus a data collection process is required, which for some of the tasks can be laborious and expensive. For example, dialogue policy annotation has to be done by an expert, better by a professional linguist. Therefore, having a model that requires only few samples to actually perform well in the tasks is essential.

Refer to caption — Figure 1: Language model priming for few-shot intent recognition. Image inspired by OpenAI GPT-3 Brown et al. (2020). Few examples are provided along with the sample to be predicted as the prefix to the language model.

The most successful approach in few-shot learning for task-oriented dialogue systems is notably transfer learning, where a large model is firstly pre-trained on a large corpus to be then fine-tuned on specific tasks. For task-oriented dialogue systems, Wu et al. (2020) proposed TOD-BERT a large pre-trained model which can achieve better performance than BERT Devlin et al. (2019) in few-shots NLU, DST and DP. Liu et al. (2020) proposed a two-step classification for few-shot slot-filling, a key task for the NLU module. Similarly, Peng et al. (2020b) introduced a benchmark for few-shot NLG and a pre-trained language model (SC-GPT) specialized for the task. Further, a template rewriting schema based on T5 Raffel et al. (2019) was developed by Kale and Rastogi (2020) for few-shot NLG in two well-known datasets. Peng et al. (2020a) proposed a pre-trained language model (LM) for end-to-end pipe-lined task-oriented dialogue systems. In their experiments, they showed promising few-shot learning performance in MWoZ Budzianowski et al. (2018). Finally, several meta-learning approaches have been proposed for DP Xu et al. (2020), NLG/ACT Mi et al. (2019), pipelined end-to-end models Qian and Yu (2019) and personalized dialogue systems Madotto et al. (2019).

For performing few-shot learning, existing methods require a set of task-specific parameters since the model is fine-tuned with few samples. Differently, in this paper, we perform few-shot learning by priming LMs with few-examples Radford et al. (2019); Brown et al. (2020). In this setting, no parameters are updated, thus allowing a single model to perform multiple tasks at the same time. In this paper, we evaluate the few-shot ability of LM priming on the four task-oriented tasks previously mentioned (i.e., NLU, DST, DP, and NLG). Currently, GPT-3 Brown et al. (2020) is not available to the public; thus we experiment on different sizes GPT-2 Radford et al. (2019) models such as SMALL (117M), LARGE (762M), and XL (1.54B). All the experiments are run on a single NVIDIA 1080Ti GPU.

2 Basic Notation and Tasks

Let us define dialogue as the alternation of utterances between two speakers denoted by $U$ and $S$ respectively. An utterance is a sequence of words $X={x_{1},\cdots,x_{n}}$ and the concatenation of $t$ utterances denotes a dialogue with $\frac{t}{2}$ turns. In this paper, we focus on the four task-oriented dialogue system tasks, and we briefly introduce the input-output of each task.

NLU

This task aims to extract slot-value pairs (SLOT-FILLING) and the intent (INTENT) from a user utterance $S$ . In the literature, the most common approach for NLU is to learn a BIO tagger for the slot-value pairs, and to learn a multi-class classifier for the intent. SLOT-FILLING gets as input a user utterance $X$ and produces a dictionary $M=\{s_{1}=v_{1},\cdots,s_{n}=v_{n}\}$ , where $s_{i}$ is a slot and $v_{i}$ is the possible value. Note that $v_{i}$ can also be None since some slots may not be mentioned in the utterance. The INTENT task gets a user utterance $X$ and classifies it into an intent class denoted by $Y\in\{y_{1},\cdots,y_{n}\}$ . Sometimes, the intent-classification is mixed with the domain classification.

DST

This task extracts slot-value pairs for a given dialogue, which can be considered as a dialogue-level of the NLU. Given a dialogue with $t$ turns as a sequence of utterance $D=\{X^{1}_{U},X^{1}_{S},\cdots,X^{t}_{U}\}$ a DST model predicts a dictionary $M_{t}=\{s_{1}=v_{1},\cdots,s_{n}=v_{n}\}$ as in the NLU. Note that most of the existing DST models use the previously generated $M_{t-1}$ and just update the slots required using an NLU tagger.

ACT

This task predicts the next speech-act (e.g., INFORM, REQUEST etc.) given the current dialogue state, in the form of a dialogue or dictionary of slot-value pairs. This is usually stated as a reinforcement learning task in both online and offline settings. In this paper, we simplify the tasks, and instead of learning a dialogue policy, we perform dialogue act classification. This a multi-label classification task, since more than one speech-act can be used in an utterance. This task gets as input a user utterance $X$ and classifies it in to a set of possible speech-acts in $I\in\{I_{1},\cdots,I_{n}\}$ .

NLG

This task maps a dialogue-act, which is made of a speech-act plus a dictionary of slot-value pairs, into natural language. The model gets as input a speech-act concatenated with a slot-value dictionary overall denoted as $I(s_{1}=v_{1},\cdots,s_{n}=v_{n})$ and it generates as output an utterance $X$ .

In the few-shot setting, a small number of input-output pairs is provided to the model, expecting a high degree of generalization.

3 Priming the LM for few-shot learning

Differently from fine-tuning, few-shot learning with LMs requires designing prefixes to perform few-shot learning. In our four tasks, we use three categories of prefixes: binary, value-based and generative. In the following notation, we use $X$ to represent a generic input and $X_{i}$ for the $i$ -th shot samples, thus implying that the prefix remains fixed during the inference and $X$ can become any input. These prefixes are provided to the LM and the generate tokens become the actual prediction, Figure 1 show an example of intent recognition.

Binary

prefixes are used for classification (namely for intent-classification and speech-act detection). We treat every classification as binary, even multi-class. To perform the few-shot priming, we use the following prefix:

X_{1}\rightarrow\texttt{True}\ X^{*}_{1}\rightarrow\texttt{False}\ \cdots\ X\rightarrow

(1)

where $X_{i}$ one of the few-shot samples and $X^{*}_{i}$ is from other classes or from the false class if it exists. To predict $n$ classes, a set of $n$ prefixes is used and thus $n$ forwards is required.

Value-based

prefixes are used to assign the value of a certain slot given an utterance, or None if no value is provided. We define a prefix for each slot, similar to TRADE Wu et al. (2019), which requires forwarding the model $n$ times for decoding $n$ -slots. To perform the few-shot priming of one slot $s$ , we use the following prefix:

X_{1}\rightarrow s=v_{1}\ X^{*}_{1}\rightarrow s=\texttt{None}\ \cdots\ X\rightarrow s=

(2)

where $v_{1}$ is the assigned value from the few-shot training. This process is repeated for each slot to generate the dictionary $M$ .

Generative

prefixes are used to instruct the model to generate natural language given source information (e.g., NLG). The prefix is the following:

X_{1}\rightarrow Y_{1}\ \cdots\ X_{k}\rightarrow Y_{k}\ X\rightarrow

(3)

where $X_{i}$ and $Y_{i}$ are generic sequences of words.

4 Experiments and Results

We use different prefix styles depending on the task and we compare the results of LM few-shot priming with those of the existing finetuning-base models. In all the experiments, we use different number of shots since different tasks may fit more or fewer samples in the 1024 max input size of GPT-2.

NLU

We use the SNIPS Coucke et al. (2018) dataset for evaluating the SLOT-FILLING and INTENT recognition tasks. For the SLOT-FILLING task, we follow the few-shot setting of Liu et al. (2020), and we use the official CoNLL F1 scorer as the evaluation metric. For the INTENT classification, we fine-tune RoBERTa Liu et al. (2019) with 10 samples and use accuracy as the evaluation metric. We use a value-based LM prefix for the SLOT-FILLING task with a maximum of 15 shots, and binary LM prefix for the INTENT classification task with a maximum of 10 shots. An example of a prefix for the SLOT-FILLING task and the few-shot performance evaluation are shown in Figure 2. Table 2 and 2 and Figure 5 show more detailed results.

DST

We use the MultiWoZ Budzianowski et al. (2018); Eric et al. (2019); Zang et al. (2020) dataset for evaluating the DST task. Differently from other works, we use the last user utterance only as input to the model, and we update the predicted-DST through turns. For the few-shot evaluation, we follow the setting of Wu et al. (2020), and we report the joint and slot accuracy. As baselines, we use TOD-BERT Wu et al. (2020) and BERT Devlin et al. (2019) fine-tuned with 10% of the training data, which is equivalent to 500 examples. We use a value-based LM prefix, as for the SLOT-FILLING task, with a maximum of 15 shots due to limited context. Table 4 and Figure 6 show more detailed results.

ACT

We use the MultiWoZ dataset for evaluating the speech ACT identification task. Differently from other works, only the system utterance is used as input to the model, instead of including the dialogue history and the user utterance as in Wu et al. (2020). For the few-shot evaluation, we follow the setting of Wu et al. (2020), i.e., F1-score. As baselines, we use TOD-BERT Wu et al. (2020) and BERT Devlin et al. (2019), fine-tuned with 10% of the training data, which is equivalent to 500 examples. We use a binary LM prefix, as for the intent classification task, with a maximum of 15 shots due to limited context. An example of a prefix for the ACT tasks and the few-shot performance evaluation is shown in Figure 3. Table 3 and Figure 7 show more detailed results.

NLG

We use the FewShotWOZ Peng et al. (2020b) dataset for evaluating the NLG task. For the few-shot evaluation, we follow the setting of Peng et al. (2020b) and use the BLEU and slot error rate (SLR) as metrics. We use SC-LSTM, GPT-2, and SC-GPT-2 Peng et al. (2020b) as baselines, all fine-tuned with 50 examples from the training data. We use a generative LM prefix with a maximum of 20 shots due to limited context. An example of prefix for the NLG task and the few-shot performance evaluation is shown in Figure 4. Table 6 and 6, and Figure 8 show more detailed results.

5 Analysis and Limitation

From the experimental results, we observe that:

•

The larger the model the better the performance in both the NLU and NLG tasks, while, instead, in the DST and ACT tasks, GPT-2 LARGE (762M) performs better than the XL (1.54B) version. This is quite counterintuitive given the results reported for GPT-3. Further investigation is required to understand whether changing the prefix can help to improve the performance of larger models;
•

In the NLU, ACT and NLG, LM priming few-shot learning shows promising results, achieving similar or better performance than the weakest finetuning-based baseline, which also uses a larger number of shots. On the other hand, in DST the gap with the existing baseline is still large.

We also observe two limitations of the LM priming:

•

Using binary and value-based generation requires as many forwards as the number of classes or slots. Although these forward passes are independent, achieving few-shot learning this way is not as effective as directly generating the class or the tag (e.g., NLU). In early experiments, we tried to covert all the tasks into a generative format, thus making the model directly generate the sequence of tags or the class label. Unfortunately, the results in the generative format were poor, but we are unsure if larger LMs such as GPT-3 can perform better.
•

The current max-input length of GPT-2 (1024 tokens) greatly limits the number of shots that can be provided to the model. Indeed, in most of the tasks, no more than 15 shots can be provided, thus making it incomparable with existing models that use a larger number of shots.

6 Conclusion

In this paper, we demonstrate the potential of LM priming few-shot learning in the most common task-oriented dialogue system tasks (NLU, DST, ACT and NLG). Our experiments show that in most of the tasks larger LMs are better few-shot learners, confirming the hypothesis in Brown et al. (2020) and, in some cases, they can also achieve similar or better results than the weakest finetuning-based baseline. Finally, we unveil two limitations of the current LM priming few-shot learning the computational cost and the limited word context size. In future work, we plan to benchmark dialogue-specific models (e.g., DialGPT) and LM with longer context size (e.g., Transformer XL Dai et al. (2019), LongFormer Beltagy et al. (2020), and BigBird Zaheer et al. (2020) etc.). We also plan to investigate adversarial triggers Wallace et al. (2019) for improving the few-shot ability of LMs, and to benchmark end-to-end dialogue tasks.

References

Beltagy et al. (2020) Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150.
Brown et al. (2020) Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
Budzianowski et al. (2018) Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gasic. 2018. Multiwoz-a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 5016–5026.
Coucke et al. (2018) Alice Coucke, Alaa Saade, Adrien Ball, Théodore Bluche, Alexandre Caulier, David Leroy, Clément Doumouro, Thibault Gisselbrecht, Francesco Caltagirone, Thibaut Lavril, et al. 2018. Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces. arXiv preprint arXiv:1805.10190.
Dai et al. (2019) Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. 2019. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
Eric et al. (2019) Mihail Eric, Rahul Goel, Shachi Paul, Adarsh Kumar, Abhishek Sethi, Peter Ku, Anuj Kumar Goyal, Sanchit Agarwal, Shuyang Gao, and Dilek Hakkani-Tur. 2019. Multiwoz 2.1: A consolidated multi-domain dialogue dataset with state corrections and state tracking baselines. arXiv preprint arXiv:1907.01669.
Kale and Rastogi (2020) Mihir Kale and Abhinav Rastogi. 2020. Few-shot natural language generation by rewriting templates. arXiv preprint arXiv:2004.15006.
Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
Liu et al. (2020) Zihan Liu, Genta Indra Winata, Peng Xu, and Pascale Fung. 2020. Coach: A coarse-to-fine approach for cross-domain slot filling. arXiv preprint arXiv:2004.11727.
Madotto et al. (2019) Andrea Madotto, Zhaojiang Lin, Chien-Sheng Wu, and Pascale Fung. 2019. Personalizing dialogue agents via meta-learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5454–5459.
Mi et al. (2019) Fei Mi, Minlie Huang, Jiyong Zhang, and Boi Faltings. 2019. Meta-learning for low-resource natural language generation in task-oriented dialogue systems. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pages 3151–3157. AAAI Press.
Peng et al. (2020a) Baolin Peng, Chunyuan Li, Jinchao Li, Shahin Shayandeh, Lars Liden, and Jianfeng Gao. 2020a. Soloist: Few-shot task-oriented dialog with a single pre-trained auto-regressive model. arXiv preprint arXiv:2005.05298.
Peng et al. (2020b) Baolin Peng, Chenguang Zhu, Chunyuan Li, Xiujun Li, Jinchao Li, Michael Zeng, and Jianfeng Gao. 2020b. Few-shot natural language generation for task-oriented dialog. arXiv preprint arXiv:2002.12328.
Qian and Yu (2019) Kun Qian and Zhou Yu. 2019. Domain adaptive dialog generation via meta learning. arXiv preprint arXiv:1906.03520.
Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9.
Raffel et al. (2019) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683.
Wallace et al. (2019) Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019. Universal adversarial triggers for attacking and analyzing nlp. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2153–2162.
Wu et al. (2020) Chien-Sheng Wu, Steven Hoi, Richard Socher, and Caiming Xiong. 2020. Tod-bert: Pre-trained natural language understanding for task-oriented dialogues. arXiv preprint arXiv:2004.06871.
Wu et al. (2019) Chien-Sheng Wu, Andrea Madotto, Ehsan Hosseini-Asl, Caiming Xiong, Richard Socher, and Pascale Fung. 2019. Transferable multi-domain state generator for task-oriented dialogue systems. arXiv preprint arXiv:1905.08743.
Xu et al. (2020) Yumo Xu, Chenguang Zhu, Baolin Peng, and Michael Zeng. 2020. Meta dialogue policy learning. arXiv preprint arXiv:2006.02588.
Zaheer et al. (2020) Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. 2020. Big bird: Transformers for longer sequences. arXiv preprint arXiv:2007.14062.
Zang et al. (2020) Xiaoxue Zang, Abhinav Rastogi, Srinivas Sunkara, Raghav Gupta, Jianguo Zhang, and Jindong Chen. 2020. Multiwoz 2.2: A dialogue dataset with additional annotation corrections and state tracking baselines. arXiv preprint arXiv:2007.12720.

Appendix A Appendices

Model	Shots	PlayL	Rest.	Weather	PlayM.	RateBook	SearchC.	Find.	Avg
gpt2	1	31.9008	8.0350	16.4160	32.8150	43.6023	20.4974	24.4994	25.3951
gpt2	10	46.2546	21.6707	19.1909	21.4724	56.2280	38.0345	41.7234	34.9392
gpt2	15	54.7410	26.4663	17.7377	28.3369	63.8482	41.3968	47.0525	39.9399
gpt2-large	1	54.7548	39.4418	23.5223	20.8827	38.3591	26.6576	43.0562	35.2392
gpt2-large	10	71.6635	39.2936	27.7395	48.1905	61.4562	44.4720	53.8340	49.5213
gpt2-large	15	71.6569	45.5142	30.7992	46.3439	61.7858	42.8394	61.1420	51.4402
gpt2-xl	1	53.8250	26.2185	23.1651	28.7647	37.1651	37.4536	31.0224	33.9449
gpt2-xl	10	70.4698	40.5039	34.7138	40.4731	74.3899	52.0532	64.4166	53.8600
gpt2-xl	15	67.9448	46.9853	30.8481	44.4646	77.1531	51.8732	67.0917	55.1944

Table 1: Results in terms of CoNNL F1-score the SLOT-FILLING task.

Model	Shots	Micro	Macro	Acc
gpt2	1	0.1600	0.1553	16.0000
gpt2	2	0.1671	0.1034	16.7143
gpt2	5	0.3443	0.3223	34.4286
gpt2	10	0.3600	0.3715	36.0000
gpt2-large	1	0.1343	0.1188	13.4286
gpt2-large	2	0.3786	0.3946	37.8571
gpt2-large	5	0.3300	0.3175	33.0000
gpt2-large	10	0.5514	0.5871	55.1429
gpt2-xl	1	0.1700	0.1346	17.0000
gpt2-xl	2	0.3586	0.3166	35.8571
gpt2-xl	5	0.4671	0.4371	46.7143
gpt2-xl	10	0.7300	0.7450	73.0000

Table 2: Results in terms of F1-score (Micro and Macro) and Accuracy in the INTENT recognition task.

Model	Shots	Micro	Macro	Acc
gpt2	5	73.9364	54.7965	0.7394
gpt2	10	78.5699	59.6442	0.7857
gpt2	15	78.0943	59.8866	0.7809
gpt2-large	5	78.7105	62.2181	0.7871
gpt2-large	10	83.5762	68.6824	0.8358
gpt2-large	15	83.5102	68.2287	0.8351
gpt2-xl	5	63.1241	52.8427	0.6312
gpt2-xl	10	75.4120	62.2672	0.7541
gpt2-xl	15	77.7434	63.0193	0.7774

Table 3: Results in terms of F1-score (Micro and Macro) and Accuracy in the ACT detection task.

Model	Shots	Joint	Slot
gpt2	5	0.7	79.8
gpt2	10	0.8	78.7
gpt2	15	0.6	79.7
gpt2-large	5	2.5	82.7
gpt2-large	10	2.6	83.2
gpt2-large	15	3.5	83.5
gpt2-xl	5	2.2	81.4
gpt2-xl	10	2.1	80.4
gpt2-xl	15	2.0	81.8

Table 4: Results in terms of Joint and Slot Accuracy in the DST task.

Model	Shots	restaurant	laptop	hotel	tv	attraction	train	taxi	Avg
SC-LSTM	50	15.90	21.98	31.30	22.39	7.76	6.08	11.61	16.71
GPT-2	50	29.48	27.43	35.75	28.47	16.11	13.72	16.27	23.89
SC-GPT	50	38.08	32.73	38.25	32.95	20.69	17.21	19.70	28.51
gpt2	5	9.93	17.75	14.85	16.29	5.50	0.26	5.01	9.94
gpt2	10	8.10	17.75	16.85	16.29	5.84	1.30	4.71	10.12
gpt2	20	10.68	17.75	19.15	16.29	4.89	3.24	7.28	11.32
gpt2-large	5	10.60	24.42	13.92	24.58	7.38	0.73	7.86	12.78
gpt2-large	10	13.10	24.42	20.68	24.58	6.68	3.18	6.25	14.13
gpt2-large	20	11.47	24.42	16.13	24.58	7.97	5.30	9.36	14.18
gpt2-xl	5	13.65	23.39	14.26	26.61	6.96	0.74	6.59	13.17
gpt2-xl	10	14.51	23.39	19.42	26.61	8.21	4.00	6.40	14.65
gpt2-xl	20	17.02	23.39	21.30	26.61	6.43	5.68	9.06	15.64

Table 5: Results in terms of BLEU score for the NLG task. SC-LSTM, GPT-2, and SC-GPT-2 are from Peng et al. (2020b).

Model	Shots	restaurant	laptop	hotel	tv	attraction	train	taxi	Avg
SC-LSTM	50	48.02	80.48	31.54	64.62	367.12	189.88	61.45	120.44
GPT-2	50	13.47	11.26	11.54	9.44	21.10	19.26	9.52	13.65
SC-GPT	50	3.89	3.39	2.75	3.38	12.72	7.74	3.57	5.35
gpt2	5	60.48	60.84	73.63	72.66	81.79	60.54	66.67	68.09
gpt2	10	72.75	60.84	78.02	72.66	80.49	88.75	59.52	73.29
gpt2	20	70.36	60.84	74.18	72.66	67.20	68.96	55.95	67.16
gpt2-large	5	55.39	36.33	84.62	44.02	64.31	58.11	44.05	55.26
gpt2-large	10	57.49	36.33	62.09	44.02	52.31	73.27	25.00	50.07
gpt2-large	20	48.20	36.33	85.71	44.02	56.07	61.35	32.14	51.98
gpt2-xl	5	44.61	29.99	67.03	37.92	67.63	55.82	44.05	49.58
gpt2-xl	10	46.41	29.99	47.80	37.92	50.87	62.36	22.62	42.57
gpt2-xl	20	44.61	29.99	68.68	37.92	56.50	52.93	30.95	45.94

Table 6: Results in terms of SLOT ERROR RATE for the NLG task. SC-LSTM, GPT-2, and SC-GPT-2 are from Peng et al. (2020b).

Language Models as Few-Shot Learner for Task-Oriented Dialogue Systems