This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Emotion Eliciting Machine: Emotion Eliciting Conversation Generation
based on Dual Generator

Hao Jiang1    Yutao Zhu2    Xinyu Zhang1    Zhicheng Dou3    Pan Du2    Te Pi1&Yantao Jia1 1Huawei Poisson Lab., Hangzhou, Zhejiang, China
2Université de Montréal, Montréal, Québec, Canada
3Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China {jianghao66,zhangxinyu35,pite,jiayantao}@huawei.com,
{yutao.zhu,pan.du}@umontreal.ca, dou@ruc.edu.cn
Abstract

Recent years have witnessed great progress on building emotional chatbots. Tremendous methods have been proposed for chatbots to generate responses with given emotions. However, the emotion changes of the user during the conversation has not been fully explored. In this work, we study the problem of positive emotion elicitation, which aims to generate responses that can elicit positive emotion of the user, in human-machine conversation. We propose a weakly supervised Emotion Eliciting Machine (EEM) to address this problem. Specifically, we first collect weak labels of user emotion status changes in a conversion based on a pre-trained emotion classifier. Then we propose a dual encoder-decoder structure to model the generation of responses in both positive and negative side based on the changes of the user’s emotion status in the conversation. An emotion eliciting factor is introduced on top of the dual structure to balance the positive and negative emotional impacts on the generated response during emotion elicitation. The factor also provides a fine-grained controlling manner for emotion elicitation. Experimental results on a large real-world dataset show that EEM outperforms the existing models in generating responses with positive emotion elicitation.

1 Introduction

Building an emotional intelligent agent, which can perceive, integrate, understand, and regulate emotions, is one of the ultimate goals of artificial intelligence Salovey and Mayer (1990); Picard (2000). Particularly, emotional chatbots are attracting more and more research interests due to their wide application in intelligent assistant, emotional escort, and online consultation Zhou et al. (2018); Huang et al. (2018); Shantala et al. (2018); Zhou and Wang (2018); Colombo et al. (2019); Shen and Feng (2020).

Refer to caption
Figure 1: An example of positive emotion elicitation in conversation. R1 is a response that neglects the user’s emotion. R2-R6 are several responses with different granularities (from most positive to less positive) of positive emotion elicitation.

Existing studies on emotional chatbots can be roughly categorized into two groups. The first group of methods usually require chatbots to generate responses with the predefined emotion Zhou et al. (2018); Huang et al. (2018); Zhou and Wang (2018); Shen and Feng (2020), while the current emotion status of the user that the machine is chatting with is neglected during the response generation. As a result, if a user worries about a problem they met, such as the example shown in Figure 1, a chatbot with the given emotion label “happy” may generate a response like “I’m happy to hear that!”, which may make the user feel uncomfortable. This example shows that a response with the positive emotion (such as R1) may fail to make the user happier. To solve this problem, the other group of approaches focus on modeling the emotion change (reaction) of the user rather than merely controlling the emotion of responses generated by the chatbots Hasegawa et al. (2013); Lubis et al. (2018, 2019). These methods aim at eliciting positive emotion of the user through the conversation111The model can also be used to elicit the negative emotion of the user, but it is less applied in practice.. For example, in Figure 1, R2-R4 are some typical generated responses that can bring the user with positive emotions to some extent. We study the latter problem in this work – how to generate responses to elicit positive emotions of the user.

Hasegawa et al. Hasegawa et al. (2013) leverage a statistical translation model for emotion eliciting in human-machine conversation. The model is initially trained on a general-purpose conversation corpus, then fine-tuned on datasets with emotion labels. Lubis et al. Lubis et al. (2018, 2019) trained a hierarchical recurrent neural network to generate emotional responses based on a positive-emotion eliciting corpus. With manually collected labels about emotions, these methods have enabled the emotion eliciting capability of the chatbots successfully. However, there are two major drawbacks in the existing methods. First, most existing methods rely on positive emotion labels. Negative samples are usually neglected based on an assumption that they are useless for training an emotion eliciting model Hasegawa et al. (2013); Lubis et al. (2018, 2019). We consider that the emotions expressed by the human utterances are often too subtle to be expressed in merely with positive emotion labels, e.g., as the complicated expression “I’m so sad, but I can’t cry.” demonstrated in Figure 1. Correspondingly, an ideal emotion eliciting bot should be able to adapt to subtle expressions by balancing the positive and negative feelings to avoid some extreme mistakes, such as responding “I’m happy to hear that!” to the above expression. Second, the emotion labels are often collected manually in the existing methods, which is quite labor-intensive and thus makes them less practical, especially when the training corpus is large.

To address these issues, in this paper, we propose a weakly supervised Emotion Eliciting Machine (EEM) to generate responses that adaptively elicit positive emotions. Specifically, we first leverage an emotion change classifier to generate weak supervision signals based on the predicted emotion scores of the utterances in the dialogue session. We leverage the difference of the emotion score between two consecutive utterances of a user to measure the effect of emotion elicitation of the utterance made by the other user. More specifically, supposing (U1U_{1}, R1R_{1}, U2U_{2}) is a triplet of consecutive utterances, U1U_{1} and U2U_{2} are utterances of a user, and R1R_{1} is the response made by another user to U1U_{1}, the difference of the emotion scores between U1U_{1} and U2U_{2} is used to indicate the effect of emotion elicitation of R1R_{1}. With this method, we can generate a large number of weakly labeled samples for our model training222Our experiments demonstrate that these weak labels have high consistency with human labeling. and save lots of human labours. With these generated labels, we further introduce an emotion eliciting factor on top of a dual-generator structure to generate responses that can balance positive and negative feelings according to the emotion changes. The positive generator aims to emphasize positive emotions during the response generation process, while the negative generator focuses on the negative ones. Our method learns to dynamically balance the influences of the positive and negative generators by an emotion eliciting factor during the training procedure. In such a way, we can fully utilize all utterances for training a chatbot, and at the same time, we can learn a fine-grained emotion elicitation model. In the inference stage, the emotion eliciting factor can be arbitrarily set to control the positivity of emotion elicitation in the generated responses. Experimental results on a large-scale real human conversation corpus (Reddit) show that our method can be well-trained with the obtained weak labels and outperforms existing state-of-the-art methods on the task of emotion elicitation.

The main contributions of this work are as follows:

(1) We propose EEM for emotion eliciting conversation generation, which can make use of not only positive emotion samples as in the literature, but also the negative ones, so as to void generating positive responses in inappropriate contexts.

(2) We propose two novel mechanisms to improve the adaptability and controllability of the model: 1) computing an emotion eliciting factor that balances the positive and negative emotional impacts of a response; 2) a dual structure that models the generation of responses in both positive and negative sides to obtain a dynamic controlling of the positivity of emotion elicitation.

(3) The results on the large scale Reddit conversation corpus show that EEM outperforms the existing models in both automatic and human evaluation measures.

2 Related Work

Open-domain Chatbots and Response Generation Existing methods on building open-domain chatbots can be categorized into two groups: retrieval-based and generation-based. Retrieval-based methods try to find the most reasonable response from a large repository of conversational data Lowe et al. (2015); Wu et al. (2017); Yuan et al. (2019). On the other hand, generation-based methods are mainly based on the sequence-to-sequence (Seq2seq) architecture with attention mechanism and aim at generating a new response for conversation context Sordoni et al. (2015b); Li et al. (2015); Niu and Bansal (2018); Baheti et al. (2018). In this study, we focus on the generation-based chatbots modeling.

Emotional Chatbots Emotional chatting model is a special type of controllable conversation models, which aims to control the emotion express of responses. Existing methods can be divided into two categories according to their different purposes: (1) Some methods focus on controlling the emotion of chatbots. For example, Zhou et al. Zhou et al. (2018) proposed an emotional chatting model with neural networks, which takes the target emotion category of responses as an additional input of the decoder, and then generates emotional responses with the proposed internal and external emotion memory. Shen and Feng Shen and Feng (2020) designed a dual task to generate emotional responses and emotional queries alternatively and improved the content consistency in emotion-controllable response generation. These methods can control the chatbot to generate responses with pre-defined emotion, but they neglect the influence of the generated response on user’s emotion. (2) Other methods pay less attention to the emotion of the chatbot but consider the emotional influence of the generated responses on the user, which is usually called emotion eliciting models. Hasegawa et al. Hasegawa et al. (2013) first studied the task of eliciting a specific emotion in conversation. They trained a machine translation model on utterances exhibiting emotion elicitation for each emotion category. Lubis et al. Lubis et al. (2018, 2019) proposed an emotion-sensitive dialogue model which associates the emotion label of the target sentence for decoding. This model is more effective than the traditional hierarchical recurrent encoder-decoder Serban et al. (2016) when trained on a manually-collected corpus containing responses of positive emotion elicitation. However, as the data with human annotation is hard to collect, their method has less application in practice.

Refer to caption
Figure 2: Overview of the proposed Emotion Eliciting Machine (EEM).

Our method is different from existing emotion eliciting models. At first, we propose a distant labeling method to get the weak label which indicates the emotion changes of the user. Our model can be trained with such data, thus is free with human annotation. On the other hand, we compute an eliciting factor based on the value of the weak label. The eliciting factor can reflect the degree of emotion changes, which boosts our model in fine-grained emotion elicitation.

3 Proposed Method

3.1 Problem Formulation and the Overall Model

Emotion eliciting chatbots aim at generating a response to improve the emotion status of the user (positive emotion elicitation). Specifically, we consider a dialogue with NN pairs of utterances (U1,R1,,Ui,Ri,,UN,RN)(U_{1},R_{1},\cdots,U_{i},R_{i},\cdots,U_{N},R_{N}), where UiU_{i} is the input message given by the user at the ii-th turn, and RiR_{i} is the corresponding response generated by the chatbot. We note S={si|i[1,N]}S=\{s_{i}|i\in[1,N]\} as the emotional status sequence of the user in the dialogue, where si[0,1]s_{i}\in[0,1] denotes the positivity of the user’s emotion at the ii-th turn.

Then, the task of positive-emotion eliciting response generation can be formulated as: generating the response RiR_{i} to improve the user’s emotional positivity si+1s_{i+1} based on the input message UiU_{i} and previous user status sis_{i}. To simplify the notation, we use triplet {U1,R1,U2}\{U_{1},R_{1},U_{2}\} to denote a consecutive utterance in the dialogue, where U1U_{1} is the input message from a user, R1R_{1} is the corresponding response from the chatbot, and U2U_{2} is the next input message/reaction from the user.

In this paper, we propose leveraging all triplets (not just triplets with s2>s1s_{2}>s_{1}) for training the response generation model. We introduce an emotion eliciting factor on top of a dual-generator structure to generate responses that can balance positive and negative feelings according to the emotion changes. The positive generator aims to emphasize positive emotions during the response generation process, while the negative generator focuses on the negative ones. Our method learns to dynamically balance the influences of the positive and negative generators by an emotion eliciting factor during the training procedure

The detailed structure of the proposed method is shown in Figure 2. At first, we compute an emotion eliciting factor λ\lambda to capture the emotion changes after the user receives the response R1R_{1} (Section 3.2). In the training stage, this factor is computed based on the emotion score of U1U_{1} and U2U_{2}. In the test stage, as U2U_{2} is unavailable, λ\lambda can be set arbitrarily to achieve positive emotion elicitation in different degrees. After capturing the emotion changes, we design a dual structure to model the response generation process in both positive and negative side. This structure contains an encoder with dual attentions and a dual decoder. The influence of both attention and decoder is determined by λ\lambda so as to provide fine-grained emotion elicitation (Section 3.3). In the remaining part of the section, we will introduce each component in detail.

3.2 Emotion Eliciting Factor λ\mathbf{\lambda}

The emotion eliciting factor λ[0,1]\lambda\in\left[0,1\right] is a continuous variable that represents the degree of positive emotion elicitation, where 1 indicates the most positive, and 0 is the most negative. Intuitively, the emotion eliciting factor is relevant to: (1) the user’s emotion status before receiving R1R_{1}, namely s1s_{1} (the status of U1U_{1}); and (2) the changes of the user’s emotion status after receiving R1R_{1}, namely (s2s1s_{2}-s_{1}).

The computation of the factor λ\lambda relies on the changes of the user’s emotion status, which is represented by the emotion scores’ difference of the user’s uterrances caused by the chatbot’s response. Specifically, we apply a pre-trained emotion polarity classifier to automatically obtain the emotional positiveness scores s1,s2[0,1]s_{1},s_{2}\in\left[0,1\right] of U1U_{1} and U2U_{2} to represent the emotion status, respectively. Then λ\lambda can be computed as:

λ\displaystyle\lambda =μs2+(1μ)Δs,\displaystyle=\mu\cdot s_{2}+(1-\mu)\cdot{\Delta s}^{\prime}, (1)
Δs\displaystyle{\Delta s}^{\prime} =s2s1+12[0,1],\displaystyle=\frac{s_{2}-s_{1}+1}{2}\in\left[0,1\right], (2)

where the weight μ\mu is learned by a feed-forward neural network on s2s_{2} and Δs{\Delta s}^{\prime} to model the implicit interactions between the emotional characteristics of U1U_{1} and U2U_{2} as:

μ=σ(w1s2+w2Δs+b),\displaystyle\mu=\sigma(w_{1}s_{2}+w_{2}{\Delta s}^{\prime}+b), (3)

where σ()\sigma\left(\cdot\right) is the sigmoid function. w1w_{1}, w2w_{2}, and bb are parameters. The computation process of λ\lambda is shown in the left side of Figure 2. In the training stage, it is determined by the emotional status s1s_{1} and s2s_{2}, which can be obtained by the pre-trained emotion classifier. By this means, we save lots of resources on human labeling. Our experiments (Section 4.1) will show that these weak labels are highly consistent with human labels, which guarantees the performance of our model. In the inference stage, λ\lambda can be arbitrarily given to control the polarity and degree of emotion elicitation. For example, to achieve the most effective positive-emotion elicitation in practical applications, we can set λ=1\lambda=1.

3.3 Dual Generator

The value of λ\lambda reflects the degree of positive emotion elicitation in R1R_{1}. We design a dual structure to learn the generation of R1R_{1} from both positive and negative side with the help of λ\lambda. For example, a more positive emotion eliciting sample (with larger λ\lambda) will contribute more on optimizing the positive generator while the less positive ones have more influence on the negative generator. In such a dual structure, all samples with different λ\lambdas can be used for training, which makes full use of the whole dataset. On the other hand, by tuning the generators in both sides, our model can achieve fine-grained emotion elicitation that is adaptive to the user’s input message.

Encoder with Dual Attention

During chatting, people usually pay attention to different context when constructing their messages. Therefore, we propose a two-headed attention mechanism to capture the different attention patterns from both positive and negative elicitation.

We adopt the attention function from Transformer Vaswani et al. (2017) for the computational efficiency:

et,j(mode)\displaystyle e^{\left(\text{mode}\right)}_{t,j} =1dh(𝐖K(mode)𝐡j)𝐖Q(mode)𝐳t1,\displaystyle=\frac{1}{\sqrt{d_{h}}}{\left(\mathbf{W}^{\left(\text{mode}\right)}_{K}\mathbf{h}_{j}\right)}^{\top}\mathbf{W}^{\left(\text{mode}\right)}_{Q}\mathbf{z}_{t-1}, (4)
αt,j(mode)\displaystyle\alpha_{t,j}^{\left(\text{mode}\right)} =softmax([,et,k(mode),])k=j,\displaystyle=\text{softmax}\left(\left[\cdots,e^{\left(\text{mode}\right)}_{t,k},\cdots\right]\right)_{k=j}, (5)
𝐜t(mode)\displaystyle\mathbf{c}_{t}^{\left(\text{mode}\right)} =j=1nαt,j(mode)𝐖V(mode)𝐡j,\displaystyle=\sum_{j=1}^{n}{\alpha_{t,j}^{\left(\text{mode}\right)}\mathbf{W}^{\left(\text{mode}\right)}_{V}\mathbf{h}_{j}}, (6)

where 𝐡\mathbf{h} and 𝐳\mathbf{z} are the outputs of the encoder and the decoder respectively; 𝐖{K,V}(mode)dh×dh\mathbf{W}^{\left(\text{mode}\right)}_{\{K,V\}}\in\mathbb{R}^{d_{h}\times d_{h}} and 𝐖Q(mode)dh×dz\mathbf{W}^{\left(\text{mode}\right)}_{Q}\in\mathbb{R}^{d_{h}\times d_{z}} are parameters; dhd_{h} and dzd_{z} are the dimensions of 𝐡\mathbf{h} and 𝐳\mathbf{z}; mode{pos,neg}\text{mode}\in\left\{\text{pos},\text{neg}\right\}; 𝐜t(pos)\mathbf{c}_{t}^{\left(\text{pos}\right)} and 𝐜t(neg)\mathbf{c}_{t}^{\left(\text{neg}\right)} are the acquired context vector by positive attention and negative attention in the time step of tt; and nn is the number of tokens in the input message.

Then the two context vectors are combined based on the emotion eliciting factor (λ\lambda) to get the integrated vector:

𝐜t=λ𝐜t(pos)+(1λ)𝐜t(neg).\displaystyle\mathbf{c}_{t}=\lambda\mathbf{c}_{t}^{\left(\text{pos}\right)}+\left(1-\lambda\right)\mathbf{c}_{t}^{\left(\text{neg}\right)}. (7)

Dual Decoder

In addition to different attentions on the context, the emotional effect of an utterance is dependent on the interaction of both the emotionally-positive pattern and the emotionally-negative pattern in the user’s mind. Activations of the positive and the negative thoughts of people are diverse due to the variety of their characteristics.

To mimic such mechanism, we also apply a dual structure in the decoder side. Our dual decoder consists of two parallel RNN-based decoders, one of which (referred to as positive decoder) can model the emotionally improving pattern, and the other (referred to as negative decoder) is to model the emotionally suppressing pattern when generating a response. Concretely, the two decoders share the same network structure and the same inputs:

𝐳tpos\displaystyle\mathbf{z}_{t}^{\text{pos}} =GRUdepos(𝐳t1,[𝐲^t1;𝐜t]),\displaystyle=\text{GRU}_{\text{de}}^{\text{pos}}\left(\mathbf{z}_{t-1},[{\hat{\mathbf{y}}_{t-1}};\mathbf{c}_{t}]\right), (8)
𝐳tneg\displaystyle\mathbf{z}_{t}^{\text{neg}} =GRUdeneg(𝐳t1,[𝐲^t1;𝐜t]),\displaystyle=\text{GRU}_{\text{de}}^{\text{neg}}\left(\mathbf{z}_{t-1},[{\hat{\mathbf{y}}_{t-1}};\mathbf{c}_{t}]\right), (9)

where [;] is the concatenation operation, 𝐜t\mathbf{c}_{t} is obtained from Eq. (7), and 𝐲^t1\hat{\mathbf{y}}_{t-1} is the embedding vector of the target word at the time step t1t-1. Finally, we use the combination of outputs from both decoders to predict the word according to the eliciting factor λ\lambda:

𝐳t\displaystyle\mathbf{z}_{t} =λ𝐳tpos+(1λ)𝐳tneg,\displaystyle=\lambda\mathbf{z}_{t}^{\text{pos}}+(1-\lambda)\mathbf{z}_{t}^{\text{neg}}, (10)
yt\displaystyle y_{t} 𝐨t=softmax(𝐖o𝐳t),\displaystyle\sim\mathbf{o}_{t}=\text{softmax}(\mathbf{W}_{o}\mathbf{z}_{t}), (11)

where the 𝐖o\mathbf{W}_{o} is the output projection matrix.

As seen in Eq. (8) - (9), the integrated hidden state of the dual decoder 𝐳t1\mathbf{z}_{t-1} is fed into both the positive and the negative decoders for the next time-step computation. Since the two decoders share the same inputs at each time step, the difference of their outputs only originates from the different values of their trainable parameters. By this means, the two decoders are forced to capture the discriminant emotional patterns imposed by the controlling condition λ\lambda, respectively.

4 Experiments

4.1 Dataset and Preprocessing

We conduct experiments on the large scale Reddit dataset333https://github.com/PolyAI-LDN/conversational-datasets/tree/master/reddit, which contains 3.7 billion conversation comments with dialogue context Henderson et al. (2019). We extract and preprocess 55.5 million (U1,R1,U2)\left(U_{1},R_{1},U_{2}\right) dialogue triplets from the dataset after discarding the triplets which have sentences containing more than 20 words or containing non-ASCII chars. Then we filter out web symbols, normalize the punctuations, and format all the letters as lowercase.

Emotion Labeling

Following previous approaches Zhou et al. (2018); Song et al. (2019), we train a classifier to get the emotion scores of all triplet in the dataset. We leverage the pre-trained DeepMoji model as the classifier and fine-tune it on the SS-Twitter dataset, which is manually labeled with emotion categories Felbo et al. (2017).

To evaluate the performance of the classifier, we conduct a manual evaluation to estimate the accuracy of the emotion classifier on our conversation dataset, and the classifier achieves an accuracy of 0.71 (see more details in Appendix). Then, the obtained classifier is used to score each utterance in all triplets. All the triplets are randomly split into training, validation set, and test set with the ratio of 8:1:1.

4.2 Evaluation Metrics

We adopt both automatic and human evaluation metrics to verify the effectiveness of our proposed method.

(1) Automatic Evaluation

Quality of response: on the content level, we adopt perplexity(PPL) for evaluating the quality of generated responses. This metric is recommended and widely used to evaluate generative dialogue systems Pietquin and Hastie (2013); Zhou et al. (2018); Lubis et al. (2018). It is worth noting that reference-based evaluation metrics such as BLEU or embedding similarity are not applicable for our experiments because we seek to generate the response with the most positive elicitation which may be different from the original ones.

Emotion Eliciting Impact: As there is no existing automatic metric for evaluating emotion elicitation, we propose an indirect approach for the evaluation, which is shown in Figure 3. The basic idea is to measure the user’s emotion changes after receiving the response given by the chatbot. To achieve this, we first train a standard Hierarchical Recurrent Encoder-Decoder (STD-HRED) model Sordoni et al. (2015a); Serban et al. (2016) by using the message-response reversely (namely using the original response as the input and the original message as target)444The model is trained on the test set of dialogue triplets to avoid leaky information in the process of evaluations.. After collecting the responses R^1\hat{R}_{1} generated by all models, we use it as input to generate the next utterance U^2\hat{U}_{2} by STD-HRED with U1U_{1} and R^1\hat{R}_{1} as inputs. Then we regard U^2\hat{U}_{2} as a probable reaction of the user after receiving the response R^1\hat{R}_{1}, and use the pre-trained emotion classifier to compute the emotion score s^2\hat{s}_{2} of U^2\hat{U}_{2}. Finally, the normalized increment Δs^\Delta\hat{s}^{\prime} between the emotion scores of U^2\hat{U}_{2} and U1U_{1} can be calculated as well. Since s^2\hat{s}_{2} and Δs^\Delta\hat{s}^{\prime} can be the estimations of s2s_{2} and Δs\Delta s^{\prime}, they can be used as the performance indicators of emotion elicitation.

(2) Human Evaluation

Due to the complexity and diversity of human language, following existing works Lubis et al. (2018, 2019), we also conduct a human evaluation. Specifically, we randomly sample 200 utterances from the test set. For each utterance, we collect the generated responses from all models. Then we hire five experienced annotators to label them. The annotators are asked to score the generated responses by stating their agreement on two statements: 1) Content score (scs_{c}) states the response is appropriate and natural to the post and is likely to be produced by a human. 2) Emotional eliciting score (ses_{e}) states the response elicits a positive emotion. Rating scales of both criteria are 0, 1, and 2, which indicate disagreement, partly agreement and agreement, respectively555The Fleiss’ kappa coefficients for the content score and emotional impact score are 0.473 and 0.631, indicating “Moderate agreement” and “Substantial agreement”, respectively..

Refer to caption
Figure 3: Automatic evaluation of emotion elicitation

4.3 Baselines

We compare our method with the following baselines:

(1) EncDec Sutskever et al. (2014): A traditional Encoder-Decoder model with attention mechanism. There is no extra emotional elicitation module injected.

(2) Emo-HRED Lubis et al. (2018, 2019): The state-of-art model for eliciting positive emotion in conversation. It is based on the structure of the HRED, with a latent emotion variable to be emotion-sensitive.

(3) Emb_s2s_{2} and (4) Emb_Δs\Delta s^{\prime}: Naive baselines combining the EnDec model with the embedded representations of s2s_{2} and Δs\Delta s^{\prime}, respectively. The embeddings serve as an extra input of the decoder to influence the generation process.

Automatic Human
Model PPL s^2\hat{s}_{2} Δs^\Delta\hat{s}^{\prime} scs_{c} ses_{e}
EnDec 48.94 0.383 -0.027 1.839 1.057
   on positive 51.19 0.452 0.042 1.612 1.324
Emo-HRED 47.00 0.387 -0.023 1.738 1.132
   on positive 47.66 0.458 0.047 1.673 1.348
Emb_s2s_{2} 48.79 0.481 0.071 1.797 1.359
Emb_Δs\Delta s^{\prime} 48.87 0.460 0.050 1.823 1.365
EEM (λ=1\lambda=1) 46.78 0.574 0.164 1.840 1.615
λ=s2\lambda=s_{2} 48.04 0.500 0.090 1.745 1.447
λ=Δs\lambda=\Delta s^{\prime} 46.72 0.543 0.133 1.765 1.515
w/o Dual Attn. 46.77 0.541 0.131 1.633 1.474
w/o Dual Dec. 48.81 0.479 0.069 1.763 1.415
Table 1: Evaluation results of all the models.
Refer to caption
Figure 4: Curves of Δs^\Delta\hat{s}^{\prime} with different emotion eliciting factors.
User Input I’m so sad, but I can’t cry.
EnDec I don’t think you know what the word means.
Emo-HRED Me too, but I don’t want to cry.
EEM (λ\lambda) 1.0 I hope you feel better soon!
0.8 I hope you get to see the rest of the world.
0.6 I feel bad for you. I hope you feel better soon.
0.4 I don’t think you know what you’re talking about.
0.2 I don’t think you understand what you’re talking about.
0.0 Why do you hate yourself so much?
Table 2: Sample responses generated by EnDec, Emo-HRED, and EEM with different emotion eliciting factors.

4.4 Overall Results

To maximize the effect of positive elicitation in the generated responses, the value of λ\lambda in Emb_s2s_{2}, Emb_Δs\Delta s^{\prime}, and all ablation models is set as 1.0 in the test stage.

The experimental results are shown in the upper side of Table 1. We can see: (1) EEM outperforms all the baselines significantly in terms of positive-emotion elicitation in both automatic and human evaluations. Specifically, EEM outperforms the previous best baseline by 0.114 on both s^2\hat{s}_{2} and Δs^\Delta\hat{s}^{\prime}. This indicates that EEM can improve the user’s emotion status by generating responses with the most positive effect of elicitation. (2) The values of Δs^\Delta\hat{s}^{\prime} obtained by both EnDec and Emo-HRED are negative. The reason is that both of these two methods are originally designed for datasets with only positive eliciting samples. Such datasets heavily rely on human annotation, which is often unavailable in real scenario. (3) To mimic a similar training process, we also train EnDec and Emo-HRED on the positive subset of the data (i.e., Δs>0\Delta s^{\prime}>0 and s2>0.5s_{2}>0.5). The corresponding experimental results are shown in Table 1 with an additional mark “on positive”. We can observe that the models can also achieve positive emotion elicitation (Δs^>0\Delta\hat{s}^{\prime}>0). However, there is still a large gap between these models and our EEM. This proves that: a) Our proposed emotion eliciting factor can reflect the different degrees of emotion elicitation, thus help the model to achieve fine-grained elicitation and make full use of the dataset; b) Training only on positive eliciting dataset is insufficient for building a positive emotion eliciting chatbot and a well-designed structure is necessary for this task. (4) The effect of positive emotion elicitation by Emb_s2s_{2} and Emb_Δs\Delta s^{\prime} are very limited. This demonstrates that simply integrating the emotion information into the model is insufficient for generating desirable responses. Similar problems are also reported by Zhou et al. (2018). (5) According to the human evaluation results, EEM also has superiority on the positive emotion elicitation. Interestingly, the general quality of the generated responses (measured by ses_{e}) is also improved. This inspires us that considering the user’s emotion changes is an effective way to improve the chatbots.

4.5 Ablation Study

We investigate the effect of each module in our model in this section. The results are shown in the lower side of Table 1.

First, we simplify λ\lambda in EEM to s2s_{2} or Δs\Delta s^{\prime}, respectively. The performance degradation demonstrates that the emotion eliciting effect is related to both the status of U2U_{2} and the emotion changes Δs\Delta s, which is consistent with our assumption. In addition, compared with the model using s2s_{2} or Δs\Delta s^{\prime} as additional input (i.e., Emb_s2s_{2} and Emb_Δs\Delta s^{\prime}), EEM achieves better results. This confirms the superiority of our proposed dual structures in emotion elicitation.

Second, we replace the dual attention and the dual decoder by a traditional attention and a tradition decoder respectively. These two models are denoted as “w/o Dual Attn.” and “w/o Dual Dec.”. Experimental results show that both of the dual structures are helpful in improving performance. Specifically, the dual decoder brings more than 14% improvements in terms of emotion elicitation under both automatic and human evaluation. This demonstrates the effectiveness of our proposed dual structure.

Moreover, we conduct additional experiments to investigate the emotion eliciting performance of these models when given different emotion eliciting factors (i.e., s2s_{2} for Emb_s2s_{2}, Δs\Delta s^{\prime} for Emb_Δs\Delta s^{\prime} and λ\lambda for EEM) for the generation. As shown in Figure 4, the curves of Emb_s2s_{2}, Emb_Δs\Delta s^{\prime} and EEM show a positive correlation between Δs^\Delta\hat{s}^{\prime} and the factor. In general, when given a bigger factor, these models tend to generate more positive eliciting responses accordingly. Among all the models, EEM shows the highest responsivity between Δs^\Delta\hat{s}^{\prime} and the factor, which proves that EEM is more controllable and flexible in fine-grained emotion elicitation.

4.6 Case Study

We also provide an example of generated responses in Table 2. As can be seen, compared with the responses generated by EnDec and Emo-HRED, the responses generated by EEM with positive settings (i.e., λ\lambdas >0.5>0.5) are more user-concerned and achieve more positive emotion elicitation. Though some generated responses are not full of positive emotions themselves, they express a degree of empathy to effectively improve the emotional status of users. Besides, we present responses generated with different λ\lambda (from 0.0 to 1.0 with an interval of 0.2). As can be seen, EEM can generate responses with different categories of emotion elicitation, and all the responses are appropriate to the inputs. This implies the controllability and flexibility of EEM in emotion elicitation.

The implementation details, the distribution of emotion scores over the dataset, the visualization of the dual attention, and more cases are provided in Appendix.

5 Conclusion and Future Work

Endowing chatbots with the ability of positive emotion elicitation is valuable to improve the users’ experience. However, manually collecting training labels for this task is labour-intensive, making existing solutions less practical. In this paper, we propose an emotion eliciting machine (EEM) trained on weak labels generated by a pretrained emotion classifier. We compute an emotion eliciting factor based on the weak labels, and propose a dual encoder-decoder structure to model the generation of responses in both positive and negative side based on the factor. Both the automatic and human evaluations on a large scale corpus show EEM outperforms the existing models in emotion eliciting conversation generation with better controllability and flexibility. In future work, we plan to refine EEM for multi-turn conversation and try to apply it on other text generation tasks.

Acknowledgement

This work was supported by National Natural Science Foundation of China No. 61872370 and No. 61832017, Beijing Outstanding Young Scientist Program NO. BJJWZYJH012019100020098, and Shandong Provincial Natural Science Foundation under Grant ZR2019ZD06.

References

  • Baheti et al. (2018) A. Baheti, A. Ritter, J. Li, and B. Dolan. Generating more interesting responses in neural conversation models with distributional constraints. In EMNLP, 2018.
  • Colombo et al. (2019) P. Colombo, W. Witon, A. Modi, J. Kennedy, and M. Kapadia. Affect-driven dialog generation. In NAACL-HLT, 2019.
  • Felbo et al. (2017) B. Felbo, A. Mislove, A. Søgaard, I. Rahwan, and S. Lehmann. Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm. In EMNLP, 2017.
  • Fleiss (1971) J. L. Fleiss. Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5), 1971.
  • Hasegawa et al. (2013) T. Hasegawa, N. Kaji, N. Yoshinaga, and M. Toyoda. Predicting and eliciting addressee’s emotion in online dialogue. In ACL, 2013.
  • Henderson et al. (2019) M. Henderson, P. Budzianowski, I. Casanueva, S. Coope, D. Gerz, G. Kumar, N. Mrkšić, G. Spithourakis, P.-H. Su, I. Vulić, and T.-H. Wen. A repository of conversational datasets. In Proceedings of the First Workshop on NLP for Conversational AI, August 2019.
  • Huang et al. (2018) C. Huang, O. Zaiane, A. Trabelsi, and N. Dziri. Automatic dialogue generation with expressed emotions. In NAACL-HLT (Short Papers), 2018.
  • K. and B. (2015) Diederik P. K. and Jimmy B. Adam: A method for stochastic optimization. In ICLR, 2015.
  • Li et al. (2015) J. Li, M. Galley, C. Brockett, J. Gao, and B. Dolan. A diversity-promoting objective function for neural conversation models. arXiv preprint arXiv:1510.03055, 2015.
  • Lowe et al. (2015) R. Lowe, N. Pow, I. Serban, and J. Pineau. The Ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems. In SIGdial, 2015.
  • Lubis et al. (2018) N. Lubis, S. Sakti, K. Yoshino, and S. Nakamura. Eliciting positive emotion through affect-sensitive dialogue response generation: A neural network approach. In AAAI, 2018.
  • Lubis et al. (2019) N. Lubis, S. Sakti, K. Yoshino, and S. Nakamura. Positive emotion elicitation in chat-based dialogue systems. TASLP, 27(4), 2019.
  • Niu and Bansal (2018) T. Niu and M. Bansal. Polite dialogue generation without parallel data. TACL, 6, 2018.
  • P. et al. (2019) Adam P., Sam G., Francisco M., Adam L., James B., Gregory C., Trevor K., Zeming L., Natalia G., Luca A., Alban D., Andreas K., Edward Y., Zachary D., Martin R., Alykhan T., Sasank C., Benoit S., Lu F., Junjie B., and Soumith C. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, 2019.
  • Picard (2000) R. W. Picard. Affective computing. MIT press, 2000.
  • Pietquin and Hastie (2013) O. Pietquin and H. Hastie. A survey on metrics for the evaluation of user simulations. The Knowledge Engineering Review, 28(1), 2013.
  • Salovey and Mayer (1990) P. Salovey and J. D. Mayer. Emotional intelligence. Imagination, cognition and personality, 9(3), 1990.
  • Serban et al. (2016) I. V. Serban, A. Sordoni, Y. Bengio, A. Courville, and J. Pineau. Building end-to-end dialogue systems using generative hierarchical neural network models. In AAAI, 2016.
  • Shantala et al. (2018) R. Shantala, G. Kyselov, and A. Kyselova. Neural dialogue system with emotion embeddings. In SAIC, 2018.
  • Shen and Feng (2020) L. Shen and Y. Feng. CDL: Curriculum dual learning for emotion-controllable response generation. In ACL, 2020.
  • Song et al. (2019) Z. Song, X. Zheng, Lu Liu, M. Xu, and X.-J. Huang. Generating responses with a specific emotion in dialog. In ACL, 2019.
  • Sordoni et al. (2015a) A. Sordoni, Y. Bengio, H. Vahabi, C. Lioma, Jakob Grue S., and J.-Y. Nie. A hierarchical recurrent encoder-decoder for generative context-aware query suggestion. In CIKM, 2015.
  • Sordoni et al. (2015b) A. Sordoni, M. Galley, M. Auli, C. Brockett, Y. Ji, M. Mitchell, J.-Y. Nie, J. Gao, and B. Dolan. A neural network approach to context-sensitive generation of conversational responses. NAACL-HLT, 2015.
  • Sutskever et al. (2014) I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NeurIPS, 2014.
  • Vaswani et al. (2017) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In NeurIPS, 2017.
  • Wu et al. (2017) Y. Wu, W. Wu, C. Xing, M. Zhou, and Z. Li. Sequential matching network: A new architecture for multi-turn response selection in retrieval-based chatbots. In ACL, 2017.
  • Yuan et al. (2019) C. Yuan, W. Zhou, M. Li, S. Lv, F. Zhu, J. Han, and S. Hu. Multi-hop selector network for multi-turn response selection in retrieval-based chatbots. In EMNLP-IJCNLP, 2019.
  • Zhou and Wang (2018) X. Zhou and W. Y. Wang. MojiTalk: Generating emotional responses at scale. In ACL, 2018.
  • Zhou et al. (2018) H. Zhou, M. Huang, T. Zhang, X. Zhu, and B. Liu. Emotional chatting machine: Emotional conversation generation with internal and external memory. In AAAI, 2018.

6 Appendix

In this section, we provide the implementation of our method, the evaluation of the emotion classifier, the distribution of emotion scores over the dataset, the visualization of the dual attention mechanism, and more cases generated by each model.

6.1 Implementation Details

Our model is implemented by PyTorch P. et al. (2019). The encoders and decoders are both 2-layer GRUs with 600 hidden neurons in each layer. The word embedding dimension is set as 300, and the vocabulary size is set as 30,000. The trainable parameters are optimized by the Adam optimizer K. and B. (2015) with batch size and learning rate as 512 and 1e-4, respectively. All models are trained in five epochs. In the inference stage, beam search is applied for decoding responses with the beam size as five.

6.2 Evaluation of the Emotion Classifier

200 utterances are randomly sampled for manual evaluation, and the emotion distribution is consistent with the whole dataset (the KL-divergence is 0.023). Each utterance is labeled as “positive”, “neutral” or “negative” according to its emotion polarity by 5 annotators. Since the emotion score predicted by the classifier is continuous, we discrete it into “negative” (score between 0 and 0.35), “neutral” (score between 0.35 and 0.65), and “positive” (score between 0.65 and 1). As a result, the emotion classifier achieves 0.71 accuracy on average. The Fleiss’ kappa coefficient Fleiss (1971) of the 5 annotations is 0.592, indicating “Moderate agreement”.

Refer to caption
Figure 5: The distribution of the emotion scores over the Reddit dataset.

6.3 Distribution of Emotion Scores

The distribution of the emotion scores is shown in Figure 5. The ratio of the “nagative”, “neutral”, and “positive” data are around 47%, 34%, and 19%, respectively. If we follow existing work Hasegawa et al. (2013); Lubis et al. (2018, 2019) and generate binary labels for the data, we can only obtain 36.67% data with positive emotion (i.e., the emotion score >0.5>0.5). In other words, more than 60% data in the dataset cannot be used for training, which is a potential reason for their weak performance.

Refer to caption
Figure 6: Visualization of Dual Attention

6.4 Visualization of Attention

To demonstrate the effectiveness and interpretability of the dual attention mechanism, we visualize the attention weights of the positive attention and the negative attention in the decoding process. The results are illustrated in Figure 6.

In this case, we input the message “Can you help me?” and set up different values of emotion eliciting factor λ\lambda for EEM. As can be seen in subfigure (a) and (c), when λ\lambda is set as 1.0, the positive attention is fully activated and it mainly focuses on “help”, “you” and “me” in the input. The combination of these words implies the user’s emotional demand (seeking help). Therefore, the activation of the positive attention contributes to generating an user-oriented and positive-emotion eliciting response as “Sure, I can help you out with that”.

On the other hand, when λ\lambda is set as 0.0, the negative attention drives the decoder to focus on “can” and “me” (as shown in subfigure (b) and (d)), which implies rationality but more self-orientation and less relevance to the user’s emotional status. As a result, the model will generate an emotionally suppressing response “I don’t know if you can help me”.

Based on these cases, we can see that our EEM can generate different responses according to the emotion eliciting factor λ\lambda by activating different modules in the dual structure. Therefore, it is better at controlling the effect of positive emotion elicitation during the response generation process.

6.5 Case Study

User Input Model Response
Can you help me? EnDec I can help you out.
Emo-HRED I don’t know what you’re talking about.
EEM λ=1.0\lambda=1.0 Sure, I can help you out with that.
λ=0.8\lambda=0.8 I love that game.
λ=0.6\lambda=0.6 I can help you out.
λ=0.4\lambda=0.4 I can help you out.
λ=0.2\lambda=0.2 I don’t know what to do.
λ=0.0\lambda=0.0 I don’t know if you can help me.
Let’s go to play soccer together. EnDec I would love to play soccer.
Emo-HRED What do you want?
EEM λ=1.0\lambda=1.0 I like to play with you!
λ=0.8\lambda=0.8 Sure, I’ll add you now.
λ=0.6\lambda=0.6 Let’s do it.
λ=0.4\lambda=0.4 I’m not a fan of soccer.
λ=0.2\lambda=0.2 I don’t know what to do.
λ=0.0\lambda=0.0 I don’t know if you can help me.
Table 3: Sample responses generated by EnDec, Emo-HRED, and EEM with different emotion eliciting factors.

In addtion to the case we shown in Section 4.6, we provide two more cases in Table 3. Similarly, EEM with λ=1\lambda=1 performs best on positive emotion elicitation.

In particular, in the second case, EEM with λ=1\lambda=1 generates a response “I like to play with you!”. Compared with responses generated by other models, tne personal pronoun “you” reduce the sense of distance between the chatbot and the user, thus can improve the user’s emotion status directly. The responses generated by the baseline methods are fluent and naturally, but they have less effect on user’s emotion (for example, “I don’t know what you’re talking about” generated by Emo-HRED in the first case). On the other hand, when λ\lambda decreases from 1.0 to 0.0, it is evident that the emotion status of the generated response changed from positive to negative. This demonstrates that our proposed EEM can achieve a fine-grained controlling on positive emotion elicitation.