¹¹institutetext: ¹¹email: {xinyi.shen,linzuoquan}@pku.edu.cn

An Empirical Study on Context Length for Open-Domain Dialog Generation

Xinyi Shen Zuoquan Lin

Abstract

Transformer-based open-domain dialog models have become increasingly popular in recent years. These models typically represent context as a concatenation of a dialog history. However, there is no criterion to decide how many utterances should be kept adequate in a context. We try to figure out how the choice of context length affects the model. We experiment on three questions from coarse to fine: (i) Does longer context help model training? (ii) Is it necessary to change the training context length when dealing with dialogs of different context lengths? (iii) Do different dialog samples have the same preference for context length? Our experimental results show that context length, an often overlooked setting, deserves attention when implementing Transformer-based dialog models. Code is available at https://github.com/PKUAI-LINGroup/context-study.

1 Introduction

Since the advent of Transformer [10], language models trained on large-scale corpora have dominated the field of machine translation and other NLP tasks, including open-domain dialog generation [11, 14]. Despite the success of Transformer-based dialog models, they were often criticized for not understanding dialog context [9, 8], which can lead to generic responses [4] or self-contradictions [3]. For Transformer-based dialog models, context is usually represented as a concatenation of historical utterances. However, there is no uniform standard for deciding how many utterances to keep in a context. For example, Meena [1] limited the context to no more than seven utterances, while PLATO [2] limited the total length of the context sequence to no more than 256 tokens. We have no idea whether the context length they choose is optimal and how changing the context length would affect the performance of the model.

In this paper, we focus on the setting of context length in Transformer-based dialog models. We pose three questions about the possible impact of context length on the model: (i) Does longer context help model training? (ii) Is it necessary to change the training context length when dealing with dialogs of different context lengths? (iii) Do different dialog samples have the same preference for context length? Regarding model selection, since we care about the impact of the context length on the model rather than the absolute performance, we take two most basic practices to implement a dialog model: training a Transformer from scratch and fine-tuning a pre-trained GPT2 [7] model. Although the performance of these two models is not comparable with the current state-of-the-art chatbots, such as ChatGPT¹¹1https://chat.openai.com/, we believe that the study of these classic paradigms can help us better understand and leverage context when designing Transformer-based dialog models.

Our experimental results are summarized by the following three findings:

•

Considering both performance and efficiency, a longer context is not necessarily better for Transformer-based dialog models.
•

The best-performing models on the entire set perform well on dialogs with varying history lengths, so there is no need to train separate models for dialogs of different lengths.
•

For different dialog samples, the optimal context length at test time is different. Considering a specific context length for each sample during the testing phase further improves model performance.

2 Experimental Setup

We treat the response generation problem as conditional language modeling. We denote a multi-turn dialog as $(u_{1},u_{2},\cdots,u_{T})$ , where $\{u_{2k}\}_{k=1}^{\lfloor T/2\rfloor}$ are utterances from one speaker and $\{u_{2k-1}\}_{k=1}^{\lceil T/2\rceil}$ are those from the other. The model is trained to maximize the conditional probability $P(u_{T}|C;\theta)$ , where $C=(u_{T-N},...,u_{T-1})$ is the context (dialog history), $N$ is the context window size, and $\theta$ is the model parameters. We investigate the impact of context length on the model by controlling the size of $N$ during training and testing.

Experiments are conducted on two widely used open-domain dialog datasets: DailyDialog [5] and PersonaChat [13]. For each multi-turn dialog, we train (or test) the model on each utterance except the first one. We study the effect of context length on the dialog models built on Transformer and GPT2. Specifically, we implement a Transformer model with three encoder layers, three decoder layers, two attention heads, and 256 hidden dimensions and train it from scratch on our experimental datasets. For GPT2, we choose its small version with 12 layers, 12 attention heads, and 768 hidden dimensions and initialize the model with the pre-trained parameters released by HuggingFace [12]²²2https://github.com/huggingface/transformers. Models are optimized by AdamW [6]. The model checkpoints that perform best on the validation set are selected for testing. We choose Perplexity as the metric because of its strong correlation with human judgment [1] and widely used for dialog model evaluation [9, 3, 11].

3 Results and Discussion

3.1 Does longer context help model training?

We first focus on the effect of context length on model training. Due to computational constraints, it is often impossible to feed the entire dialog history into the model. Intuitively, giving the model as much history as possible during training should help the model learn how to generate responses since more information is available. But is this the case for Transformer-based dialog models? To figure this out, we compare models trained with the context of different lengths. As shown in Fig. 1, although GPT2 outperforms Transformer on all context length settings, we can observe similar trends for both models: Initially increasing the number of history utterances in the context can improve the performance of the model, but after the context reaches a certain length, continuing to grow the context length is no longer effective. To more concretely reflect the effect of increasing the context length on the model, we define perplexity gain $G_{i}$ as a representation of the gain brought by increasing the context length to $i$ :

G_{i}=\min_{1\leq j<i}p_{j}-p_{i},

(1)

in which $p_{j}$ is the test perplexity of the model trained with context length $j$ . A positive $G_{i}$ means that increasing the training length of the model to $i$ can improve performance, and a larger $G_{i}$ means a more significant improvement. As shown in Fig. 1, when the training context length exceeds 5 on DailyDialog and 9 on PersonaChat, increasing the context length will either make the model performance worse or bring minimal gain. This result suggests that, for Transformer-based dialog models, whether trained from scratch or fine-tuned from pre-trained models, the limitation of context length at training time must be carefully considered. Although longer context length in the training phase does not necessarily lead to worse model performance, it does incur unnecessary computational costs.

3.2 Is it necessary to change the training context length when dealing with dialogs of different context lengths?

Table 1: Perplexity gap between the overall-optimal and group-optimal models. The numbers in parentheses are the maximum context length for samples in each group. ‘-’ means that the overall best-performing model is also the best in this group.

Model	DailyDialog			PersonaChat
Model	short(3)	medium(6)	long(25)	short(4)	medium(8)	long(25)
Transformer	0.10	0.13	-	0.10	-	-
GPT2	0.09	-	-	0.20	0.13	-

Previous results concern the overall effect of training context length on the model. But if we take a deep look into the dataset, we find that the context length of the samples varies a lot, ranging from 1 to 25 in both test sets. So here we raise a new question: Do dialogs of different lengths have the same preference for models? To answer this question, we group the test data according to the context length and compare the performance of models trained with different context lengths in each group separately. We denote the model that achieves the lowest perplexity on the entire set as $\mathcal{M}$ , the model that achieves the lowest perplexity on group $g$ as $\mathcal{M}_{g}$ . For each $g\in\left\{\text{short},\text{medium},\text{long}\right\}$ , we measure the gap between $\mathcal{M}$ and $\mathcal{M}_{g}$ as

P_{\mathcal{M}}(g)-P_{\mathcal{M}_{g}}(g),

(2)

where $P_{\mathcal{M}}(g)$ is the perplexity of $\mathcal{M}$ on group $g$ . As shown in Table 1, $\mathcal{M}$ is optimal on half of all groups. On the remaining groups, the gap between $\mathcal{M}$ and the optimal model is quite small. This result suggests that dialogs of different lengths do not have a clear preference for context length in the training phase. The model that performs best on the entire set is a proper choice for dialogs with varying history lengths.

3.3 Do different samples have the same preference for context length?

Table 2: Perplexity reduction on DailyDialog test set by using optimal context length

Model	$\mathcal{D}_{1}$	$\mathcal{D}_{2}$	$\mathcal{D}_{3}$	$\mathcal{D}_{4}$	$\mathcal{D}_{5}$	$\mathcal{D}_{6}$	$\mathcal{D}_{7}$	$\mathcal{D}_{8}$	$\mathcal{D}_{9}$	$\mathcal{D}_{\geq 10}$	all
Transformer	0	0.84	1.05	1.12	1.58	1.46	1.44	1.58	1.26	1.75	1.09
GPT2	0	0.28	0.49	0.56	0.66	0.71	0.76	0.78	0.76	0.82	0.51

Previous experiments reflect the average performance on the test set, but not all dialog samples benefit from long context. To illustrate this, we split the test set according to context length, where $\mathcal{D}_{i}$ consists of all samples with context length $i$ . For each sample in $\mathcal{D}_{i}$ , we use a trained model to test its perplexity with all available test context length settings. Then, we count the proportion of samples in each group that achieve optimal perplexity for each test context length. Fig. 2 shows the results on DailyDialog. No matter which test model is used, an unignorable proportion of samples in each test context length setting achieve optimal perplexity. Although most samples achieve optimal perplexity with the longest test context, this ratio shrinks as the dialog history length increases, which indicates that setting a uniform test history length for all dialogs may not be the best practice. Furthermore, we show to what extent setting different context lengths for each sample during the testing phase can improve the model’s performance. For each sample, we specify the context length that makes it the lowest perplexity at test time as its optimal context length. We compare the gap between testing with the maximum context length and the optimal context length on each group and the whole test set. As shown in Table 2, using optimal context length improves the performance of the model in each group, especially on dialogs with longer histories. This improvement is especially noticeable on the Transformer, where we can observe improvements of more than 1 point in most groups. It is surprising that removing part of the history information during the test phase can improve the test performance of the model so much. However, the optimal context length is unavailable in practice because we cannot compute the perplexity without the real responses. We have to determine the context length according to the context itself, which is left to future work.

4 Conclusion

We conducted an empirical study on the context length of Transformer-based open-domain dialog models. We found that a carefully chosen context length balances performance and efficiency and that the overall best-performing model performs equally well on conversation data of different lengths. We pointed out that choosing the context length individually for each sample during the testing phase significantly improves the performance of the model.

For a dialog model to perform well, the context length in the training phase needs to be carefully considered. If we want the model to perform better, a potential direction is to learn the context length in the model.

References

[1] Adiwardana, D., et al.: Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977 (2020)
[2] Bao, S., He, H., Wang, F., Wu, H., Wang, H.: PLATO: Pre-trained dialogue generation model with discrete latent variable. In: Proc. of ACL (2020)
[3] Kim, H., Kim, B., Kim, G.: Will I sound like me? improving persona consistency in dialogues through pragmatic self-consciousness. In: Proc. of EMNLP (2020)
[4] Li, J., Galley, M., Brockett, C., Gao, J., Dolan, B.: A diversity-promoting objective function for neural conversation models. In: Proc. of NAACL (2016)
[5] Li, Y., Su, H., Shen, X., Li, W., Cao, Z., Niu, S.: DailyDialog: A manually labelled multi-turn dialogue dataset. In: Proc. of IJCNLP (2017)
[6] Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: Proc. of ICLR (2019)
[7] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI blog (2019)
[8] Saleh, A., Deutsch, T., Casper, S., Belinkov, Y., Shieber, S.: Probing neural dialog models for conversational understanding. In: Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI (2020)
[9] Sankar, C., Subramanian, S., Pal, C., Chandar, S., Bengio, Y.: Do neural dialog systems use the conversation history effectively? an empirical study. In: Proc. of ACL (2019)
[10] Vaswani, A., et al.: Attention is all you need. In: Proc. of NeurIPS (2017)
[11] Wolf, T., Sanh, V., Chaumond, J., Delangue, C.: Transfertransfo: A transfer learning approach for neural network based conversational agents. arXiv preprint arXiv:1901.08149 (2019)
[12] Wolf, T., et al.: Transformers: State-of-the-art natural language processing. In: Proc. of EMNLP (2020)
[13] Zhang, S., Dinan, E., Urbanek, J., Szlam, A., Kiela, D., Weston, J.: Personalizing dialogue agents: I have a dog, do you have pets too? In: Proc. of ACL (2018)
[14] Zhang, Y., et al.: DIALOGPT : Large-scale generative pre-training for conversational response generation. In: Proc. of ACL (2020)