This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Institute of Information Engineering, Chinese Academy of Sciences, China 22institutetext: School of Cyber Security, University of Chinese Academy of Sciences, China

Do You Know My Emotion? Emotion-Aware Strategy Recognition towards a Persuasive Dialogue System

Wei Peng 1122 0000-0001-8179-1577    Yue Hu Corresponding author.1122    Luxi Xing 1122    Yuqiang Xie 1122    Yajing Sun 1122
Abstract

Persuasive strategy recognition task requires the system to recognize the adopted strategy of the persuader according to the conversation. However, previous methods mainly focus on the contextual information, little is known about incorporating the psychological feedback, i.e. emotion of the persuadee, to predict the strategy. In this paper, we propose a Cross-channel Feedback memOry Network (CFO-Net) to leverage the emotional feedback to iteratively measure the potential benefits of strategies and incorporate them into the contextual-aware dialogue information. Specifically, CFO-Net designs a feedback memory module, including strategy pool and feedback pool, to obtain emotion-aware strategy representation. The strategy pool aims to store historical strategies and the feedback pool is to obtain updated strategy weight based on feedback emotional information. Furthermore, a cross-channel fusion predictor is developed to make a mutual interaction between the emotion-aware strategy representation and the contextual-aware dialogue information for strategy recognition. Experimental results on PersuasionForGood confirm that the proposed model CFO-Net is effective to improve the performance on M-F1 from 61.74 to 65.41.

Keywords:
Persuasive dialogue Emotional feedback Strategy recognition.

1 Introduction

Persuasive conversation is an essential area in dialogue systems and has embraced a boom in recent NLP research [4, 32, 10, 35]. In a dyadic persuasive dialogue, one party, the persuader, tries to induce another party, the persuadee, to believe something or to do something [14] by a series of persuasion strategies [34]. However, recognizing the persuasion strategy is challenging in the field of natural language understanding since it needs a deeper understanding of conversation, semantic information, and even the psychological feedback of speakers [4, 27, 25]. Furthermore, dialogue systems can utilize the predicted historical strategy chains to guide the dialogue generation task.

Refer to caption
Figure 1: Statistics in the dataset to show relationships between the emotion and strategy.

To make persuasive strategy prediction, mainstream studies [4, 7] focused on the conversational context to recognize strategies. Some work considered resistance strategies to model the strategy conversations, such as [6] and [30]. However, analyzing and understanding speaker’s psychological emotion is an essential job [22, 23] to fully understand the conversation and help persuader to adopt appropriate strategies. Previous methods do not take the persuadee-aware emotional feedback into account thereby fail to model the emotion-aware human persuasive dialogue systems. To illustrate the importance of emotional feedback, the statistics in the dataset have shown the relationships between the emotion and strategy in Fig. 1. The whole plane is divided into four quadrants. As shown in quadrant-I, if the persuadee shows positive emotion after using the strategy 𝒳\mathcal{X}, the probability of strategy 𝒳\mathcal{X} continuing to be used is 63% in the following conversation. Similarly, in quadrant-III, when the persuadee shows negative emotion, the probability of the strategy not being used in the subsequent conversation is 75%.

Refer to caption
Figure 2: An example is to compare previous work (a) that utilized the contextual information and our work (b) that considers emotional feedback of the persuadee to recognize the strategy. \small{n}⃝ indicates the order of processes.

Statistical results indicate that if the strategy obtains positive feedback, it can be given priority in the future. On the contrary, the strategy should be paid less attention [2, 28]. To present the discrepancy between the previous work (a) and ours (b), an example is shown in Fig. 2. Specifically, in the third turn, (a) outputs the wrong prediction personal-related inquiry which has received the negative emotional feedback in the previous turn. Therefore, it would be more appropriate to give priority to a different strategy based on the emotional feedback. This leaves us with: How to model and incorporate emotional feedback into the contextual dialogue information to achieve a better strategy recognition?

In this paper, the proposed Cross-channel Feedback memOry Network (CFO-Net) leverages persuadee’s emotional feedback to iteratively measure the potential benefit of historical strategies, and further the updated representations of strategies are used to guide the strategy recognition. Specifically, the novel feedback memory module designs strategy pool and feedback pool to process and store the historical strategies and update the strategy weight based on the emotional feedback, respectively. Furthermore, the emotion-aware strategy representation and the contextual information are interacted by the designed cross-channel fusion predictor to make the final strategy recognition.

The contributions can be summarized as follows:

  • We propose a CFO-Net to leverage persuadee’s emotional feedback to measure the potential benefit of historical strategies, and incorporate them into context with cross-channel fusion predictor for persuasive strategy recognition.

  • A novel feedback memory module is presented to keep track of the historical strategies and further to obtain the emotion-aware strategy representation in a dynamic and iterative manner.

  • Experiments on the dataset show that the CFO-Net has strong competitiveness with baselines and improves the performance of strategy recognition significantly.

2 Related Work

2.1 Non-Collaborative Dialogue

In collaborative dialogue, systems collaborate and communicate with each other to achieve a common goal [8]. A large number of researches [3, 16, 32] have shown remarkable advancement in the collaborative setting. However, they are out of scope when applied to non-collaborative settings like negotiation or persuasion. For the negotiation task, two agents have a conflict of interest but must strategically communicate to reach an agreement like a bargaining scenario [9]. In this paper, the main focus is on the persuasive scenario, where the persuader tries to induce people to donate [34]. The persuasion strategies are identified as ten categories in [34] that can be divided into two types, 1) persuasive appeal and 2) persuasive inquiry. Specifically, persuasive appeal contains seven strategies (Logical appeal, Emotion appeal, Credibility appeal, Foot-in-the-door, Self-modeling, Personal story and Donation information). For example, personal story refers to the strategy of using narrative examples to state someone’s donation experiences or the beneficiaries’ positive outcomes, which can encourage others to follow the actions. In addition, the three strategies (Task-related inquiry, Personal-related inquiry and Source-related inquiry) belong to persuasive inquiry, which builds better interpersonal relationships by asking questions. For example, source-related inquiry asks whether the persuadee knows about the organization (i.e., the source in our specific donation task).

2.2 Persuasive Dialogue Systems

Persuasive dialogue systems, which have come to increasing attention, aim to change people’s behaviors by persuasive strategies [1, 12, 21, 36]. For instance, [11] proposed a two-tiered annotation scheme to distinguish claims in an online persuasive forum. [10] proposed to predict persuasiveness by modeling argument sequence in social media. [35] designed a hierarchical neural network to identify persuasion strategies. Furthermore, some work focused on the contextual information and modeled the utterances to recognize the strategy. [7] explored and quantified the role of context for different aspects of dialogue for strategy prediction. [4] introduced a transformer-based approach coupled with Conditional Random Field for strategy recognition. A few work considered the resistance strategies to model the strategy conversations like [30] and [6]. The Hybrid-RCNN [34] extracted sentiment embedding features (pos, neg, neu) but did not include the emotion in the history modeling, and ignored the corresponding strategy. To overcome these defects, we present the CFO-Net to leverage the emotional feedback to iteratively measure the potential benefits of strategies and incorporate them into the context.

3 Approach

As shown in Fig. 3, the proposed model consists of three components: (a) a hierarchical encoder, which encodes the contextual dependency with the multi-head attention to capture the semantic information, (b) a feedback memory module, which models the interaction between the strategy pool and the feedback pool to obtain emotion-aware strategy representation, and (c) a cross-channel fusion predictor, which makes an interaction between the emotional feedback and the contextual information, and outputs the final result. Each component is described in the following.

Refer to caption
Figure 3: The overview of CFO-Net, which consists of hierarchical encoder, feedback memory module and cross-channel fusion predictor. Green and blue vertical bars mean the utterances of persuader and persuadee. The emotion-aware strategy representation is updated iteratively based on the strategy pool and feedback pool.

3.1 Hierarchical Encoder

The hierarchical encoder uses a Bi-directional LSTM (BiLSTM) [13] or BERT-style encoder [5, 18, 17], which capture the temporal features within the words. Then, the Multi-head Attention aims to explore the semantic information at different granularity.

3.1.1 Utterance Encoder with BiLSTM

The Utterance Encoder vectorizes an input utterance. Given a historical conversation C=(u1,u2,,uN)C=(u_{1},u_{2},\dots,u_{N}) a set of N{N} utterances, where ui=(xi,1,xi,2,,xi,T)u_{i}=(x_{i,1},x_{i,2},\dots,x_{i,T}) that consists of a sequence of TT words, uNu_{N} indicates the utterance of the persuader, which is uesd to predict the persuasion strategy. A BiLSTM is utilized to encode each word xi,tx_{i,t} in the utterance uiCu_{i}\in C, leading to a series of context-aware hidden states (hi,1,hi,2,,hi,T)(\textbf{h}_{i,1},\textbf{h}_{i,2},\dots,\textbf{h}_{i,T}), hi,t=concat[hi,t;hi,t]\textbf{h}_{i,t}={\rm{concat}}[~{}\overrightarrow{\textbf{h}_{i,t}}~{};~{}\overleftarrow{\textbf{h}_{i,t}}~{}].

The last hidden state hi,T\textbf{h}_{i,T} is considered to get the utterance-level representation. (Note: the representation of the [CLS] is used as the utterance-level representation in BERT-style encoders). Therefore, the set of NN utterances in CC can be represented as H=(h1,T,h2,T,,hN,T)\textbf{H}=(\textbf{h}_{1,T},\textbf{h}_{2,T},\dots,\textbf{h}_{{N},T}).

3.1.2 Utterance-level Multi-head Attention

To explore the semantic information at different granularity, the multi-head attention [31] is adopted as shown in Eq. (1). ci\textbf{c}_{i} indicates the representation of ii-th utterance:

ci=Multi-headAttention(hi,T)\textbf{c}_{i}={\rm{Multi\verb|-|head~{}Attention}}(\textbf{h}_{i,T}) (1)

3.2 Feedback Memory Module

The proposed feedback memory module is composed of three novel factors. (i) Strategy Embedding represents the features of strategies which will be continuously updated to capture persuasive strategy features. (ii) Strategy Pool temporarily processes and stores all the possible historical strategies for future reference. (iii) Feedback Pool considers the emotional feedback of the persuadee to measure the potential benefits of strategies and updates the strategy weight γ\gamma. Finally, the strategy pool and feedback pool are interacted to obtain the emotion-aware strategy representation for later strategy recognition.

3.2.1 Strategy Embedding

In the feedback memory module, a randomly initialized strategy embedding is defined to represent the strategy features as SL×d\textbf{S}\in\mathbb{R}^{L\times d}, where LL is the number of the strategy labels and dd indicates the dimension. The strategy embedding will be continuously updated to capture persuasive strategy features. Specificly, CFO-Net selects the appropriate strategies (i.e. top-kk) based on the context from strategy embedding and stores them into the strategy pool with the context-aware softmax function that shown in Eq. (2).

Refer to caption
Figure 4: The two-stream mask mechanisms are defined in the feedback memory module.

3.2.2 Strategy Pool

Strategy pool aims to process and store the possible historical strategies for future reference. As shown in Fig. 4, to achieve the selection of strategies and prevent gradient truncation, two-stream mask mechanisms are defined in the following:

  • maskpmask_{p}: The selected strategies (i.e. top-kk) are stored into the strategy pool to reserve the possible historical strategies (here, size is set to 1010).

  • maskfmask_{f}: The best strategy of the current moment is stored into the feedback pool.

Specificlly, the module first outputs a probability distribution α\alpha of the strategy based on the contextual information, as:

α=softmax(MLP([c1;;cN]))\alpha={\rm softmax}({\rm MLP}([\textbf{c}_{1};\dots;\textbf{c}_{N}])) (2)

Then, the maskpmask_{p} is obtained based on the α\alpha with the top-kk operation where kk is a hyper-parameter, and maskfmask_{f} is obtained when k=1k=1. The strategies Sp\textbf{S}_{p} which contain multiple possible strategies are stored into the strategy pool, as:

Sp=S(maskp𝐞d)\textbf{S}_{p}=\textbf{S}\odot(mask_{p}\otimes\mathbf{e}_{d}) (3)

where \odot is element-wise multiplication, (𝐞d)(\cdot\otimes\mathbf{e}_{d}) produces a matrix by repeating the vector on the left for dd times [33].

The strategies Sm{\textbf{S}}_{m} in the strategy pool are obtained by making a concatenation with the stored strategies Sp\textbf{S}_{p}. Similarly, the strategy Sf\textbf{S}_{f} that stored into the feedback pool is formulated as:

Sf=S(maskf𝐞d)\textbf{S}_{f}=\textbf{S}\odot(mask_{f}\otimes\mathbf{e}_{d}) (4)

3.2.3 Feedback Pool

The purpose of the feedback pool is to update the strategy weight γ\gamma dynamically to record the emotional feedback of the persuadee towards the strategy. The tuple {strategy, emotion} stored in the pool calculates the strategy weight γL\gamma\in\mathbb{R}^{L} that is used to obtain the subsequent emotion-aware strategy representation. Firstly, the representation of utterance ci\textbf{c}_{i} is considered to predict the emotional label ye{pos,neu,neg}\textbf{y}^{e}\in\{pos,neu,neg\} of the persuadee, as:

ye=softmax(MLP([c1;;cN1]))\textbf{y}^{e}={\rm softmax}({\rm MLP}([\textbf{c}_{1};\dots;\textbf{c}_{N-1}])) (5)

where cN1\textbf{c}_{N-1} indicates the (N1)th{(N-1)}^{th} utterance spoken by the persuadee.

Refer to caption
Figure 5: The three different cross-channel fusion mechanisms that include (a) MLP [20], (b) double-head linear layer and (c) co-interactive attention layer.

Then, the weight γ\gamma is assigned based on the score of the predicted emotion and stream maskfmask_{f}. To enhance the generalization of the model, soft weight γL\gamma\in\mathbb{R}^{L} (randomly initialized with an all-one vector at the first) can be defined as:

γi={γi+maskfμexpζifpos;γiifneu;γimaskfμexpζifneg;\gamma_{i}=\begin{cases}\gamma_{i}+{mask_{f}}\cdot\mu\exp^{-\zeta}&if\quad{\rm pos};\\ \gamma_{i}\qquad\qquad\quad\qquad&if\quad{\rm neu};\\ \gamma_{i}-{mask_{f}}\cdot\mu\exp^{-\zeta}&if\quad{\rm neg};\end{cases} (6)

where the scalar parameter μ\mu controls the proportion of expζ\exp^{-\zeta} that guarantees to be greater than zero. For the first condition, the weight of γ\gamma increases when ζ\zeta becomes smaller. To this end, we intuitively set the confidence factor ζ\zeta that depends on the score of emotion ye\textbf{y}^{e}, as:

ζ=(1yxe)\zeta=(1-{y}^{e}_{x}) (7)

where yxe{y}^{e}_{x} is a scalar that indicates the score of the x{pos,neu,neg}x\in\{pos,neu,neg\} emotion. Finally, the emotion-aware strategy representation S\textbf{S}^{\prime} is modeled as:

S=γSm\textbf{S}^{\prime}=\gamma\cdot\textbf{S}_{m} (8)

3.3 Cross-channel Fusion Predictor

In this section, the predictor aims to make a recognition of the strategy. Three main types of fusion mechanisms are designed for horizontal comparison in Fig. 5. The mechanisms are introduced to fully interact the psychological feedback with the contextual dialogue information. And the predictor outputs the fusion distribution which captures the profound relationships between two sources.

3.3.1 Multi-layer Perceptron

An MLP can obtain the integrated representation automatically in a simple fashion, as:

g=MLP([c1;;cN;s1;;sL])\textbf{g}={\rm{MLP}}([\textbf{c}_{1};\dots;\textbf{c}_{N};\textbf{s}_{1}^{\prime};\dots;\textbf{s}_{L}^{\prime}]) (9)

The predicted distribution of the strategy ys\textbf{y}^{s} can be defined as follows:

ys=softmax\displaystyle\textbf{y}^{s}={\rm{softmax}} (Wsg+bs)\displaystyle(\textbf{W}^{s}{\textbf{g}}+\textbf{b}_{s}) (10)

where WsL×2d\textbf{W}^{s}\in\mathbb{R}^{L\times 2d} is transformation matrices, bsL\textbf{b}_{s}\in\mathbb{R}^{L} is the bias vector, LL is the number of the labels.

3.3.2 Double-head Linear Layer

To achieve the fusion of two probability distribution, a double-head linear layer is designed for prediction. Specifically, we introduce two MLPs to calculate respective probabilities and then combine them, as:

y1s=softmax(\displaystyle\textbf{y}^{s}_{1}={\rm{softmax}}( MLP([c1;;cN]))\displaystyle{\rm{MLP}}~{}([\textbf{c}_{1};\dots;\textbf{c}_{N}])) (11)
y2s=softmax(\displaystyle\textbf{y}^{s}_{2}={\rm{softmax}}( MLP([s1;;sL]))\displaystyle{\rm{MLP}}~{}([\textbf{s}_{1}^{\prime};\dots;\textbf{s}_{L}^{\prime}])) (12)
ys=soft\displaystyle\textbf{y}^{s}={\rm{soft}} max(y1s+y2s)\displaystyle{\rm{max}}(~{}\textbf{y}^{s}_{1}+\textbf{y}^{s}_{2}~{}) (13)

where y1sL\textbf{y}^{s}_{1}\in\mathbb{R}^{L} and y2sL\textbf{y}^{s}_{2}\in\mathbb{R}^{L}, ys\textbf{y}^{s} is the final predicted distribution of the strategy.

3.3.3 Co-interactive Attention Layer

Motivated by attention mechanism [19, 29, 26], the co-interactive attention layer is proposed to effectively model mutually relational dependency. In this layer, attentions are computed in two directions: from C=(c1,,cN)\textbf{C}=(\textbf{c}_{1},\dots,\textbf{c}_{N}) to S=(s1,,sL)\textbf{S}^{\prime}=(\textbf{s}_{1}^{\prime},\dots,\textbf{s}_{L}^{\prime}) as well as from S\textbf{S}^{\prime} to C.

Specifically, the layer first yields a shared similarity matrix AN×LA\in\mathbb{R}^{{N}\times L}, between C and S\textbf{S}^{\prime}. Aij\textbf{A}_{ij} indicates the similarity between ii-th context-aware utterance and jj-th emotion-aware strategy, as:

𝐀ij=(𝐂:i,𝐒:j){\bf A}_{ij}=\mathcal{F}({\bf C}_{:i},{\bf S}^{\prime}_{:j}) (14)

where \mathcal{F} is a dot product function, 𝐂:i{\bf C}_{:i} is ii-th row vector of 𝐂{\bf C}, and 𝐒:j{\bf S}^{\prime}_{:j} is jj-th row vector of 𝐒{\bf S}^{\prime}.

The attention weights and the attended vectors can be obtained in both directions. Firstly, considering the direction from S\textbf{S}^{\prime} to C, the attention weight is computed by 𝐚i=softmax(𝐀i:)L{\bf a}_{i}=\mathrm{softmax}({\bf A}_{i:})\in\mathbb{R}^{L}, and subsequently context-aware utterance vector is 𝐂~:i=j𝐚ij𝐒:j\tilde{{\bf C}}_{:i}=\sum_{j}{\bf a}_{ij}{\bf S}^{\prime}_{:j}. Similarly, the attention weight 𝐛j=softmax(𝐀:j)N{\bf b}_{j}=\mathrm{softmax}({\bf A}_{:j})\in\mathbb{R}^{N}, and updated emotion-aware strategy vector is 𝐒~:j=i𝐛ij𝐂i:\tilde{{\bf S}}^{\prime}_{:j}=\sum_{i}{\bf b}_{ij}{\bf C}_{i:}.

Finally, the context-aware utterance representation and emotion-aware strategy representation are combined to yield g and ys\textbf{y}^{s} like Eq. (9) and Eq. (10), as:

ys=softmax(Wsg+bs)\textbf{y}^{s}={\rm{softmax}}(\textbf{W}^{s}{\textbf{g}}+\textbf{b}_{s}) (15)

3.4 Training

The objective of strategy and emotion prediction can be formulated as:

s=\displaystyle\mathcal{L}_{s}=- i=1D𝐲^islog(𝐲is)\displaystyle\sum_{i=1}^{D}\hat{{\bf{y}}}_{i}^{s}\log\left({\bf{y}}_{i}^{s}\right) (16)
e=\displaystyle\mathcal{L}_{e}=- i=1D𝐲^ielog(𝐲ie)\displaystyle\sum_{i=1}^{D}\hat{{\bf{y}}}_{i}^{e}\log\left({\bf{y}}_{i}^{e}\right) (17)

where DD is the number of the training data, 𝐲^is{\hat{{\bf{y}}}_{i}^{s}} and 𝐲^ie{\hat{{\bf{y}}}_{i}^{e}} are gold strategy label and sentiment label, respectively. The joint objective function θ\mathcal{L_{\theta}} is formulated with the hyper-parameters β\beta as, θ=β1s+β2e\mathcal{L_{\theta}}=\beta_{1}\mathcal{L}_{s}+\beta_{2}\mathcal{L}_{e}.

4 Experiments

4.1 Experimental Setting

Dataset & Evaluation Metric Considering there is no emotional score in other dataset, we focus on the PersuasionForGood [34] 111The data are available at:  https://gitlab.com/ucdavisnlp/persuasionforgood whose sentiment label can be obtained based on the manually annotated score. The persuader strategies are identified to ten categories (detail in Section 2) and one none category. As for the evaluation metric, Precision, Recall, and Macro F1 (M-F1) are used for the strategy recognition and emotion prediction as the dataset is highly imbalanced [4].

Implementation details The BERT-style baselines have the same hyper parameters given on the paper [5, 18]. Adam optimizer [15] is used for training, with a start learning rate from {2{2}e-5{5}, 44e-55, 66e-55, 88e-55} and mini-batch size from {32{32}, 6464}. The epoch of training is set from {33, 55, 7{7}, 99}. The scalar parameter μ\mu is set from {0.2{0.2}, 0.50.5}. kk is set to 2{2} based on the parameter analysis. The historical strategies and emotion will be preprocessed to the two pool. To coordinate the joint training of the two training objectives, we set β1\beta_{1} = β2\beta_{2} = 0.50.5. Tesla V-100 GPU and PyTorch [24] are used to implement our experiments.

Strategy Recognition Emotion Prediction
P \uparrow R \uparrow M-F1 \uparrow P \uparrow R \uparrow M-F1 \uparrow
Hybrid RCNN + All features [34] 62.17* 59.80* 58.76* - - -
RoBERTalarge LogReg [4] 64.88* 68.32* 63.15* - - -
RoBERTalarge cLSTM [4] / / 64.10 - - -
RoBERTalarge DialogueRNN [4] / / 64.30 - - -
RoBERTabase [18] 59.58 64.39 58.35 53.21 72.05 60.41
CFO-Netbase 63.29 67.74 62.41 53.08 75.22 61.94
\cdashline1-7[1pt/1pt] RoBERTalarge [18] 62.69 69.91 61.74 55.49 71.30 62.11
CFO-Netlarge 66.81 72.28 65.41 58.11 75.88 63.91
Table 1: Experiments on PersuasionForGood for strategy recognition and emotion prediction. - indicates the baselines don’t take emotional feedback into account, therefore the results are none. * indicates the experiments are implemented by ourselves.

4.2 Experimental Results

4.2.1 Baselines

State-of-the-art models are used as baselines to test the performance. Considering the advantages of pre-trained language models (PLMs), we replace the Bi-LSTM with RoBERTa [18] to strengthen the baseline for fair comparison, as with the work [4]. The base and large PLMs are used in the main experiments for a complete comparison. To increase training speed, the base PLMs are utilized in other experiments. Other baselines are shown in Table 1, [34] considered a hybrid RCNN model to extract textual features. [4] combined the PLMs with some state-of-the-art models to recognize the strategy of the persuader.

4.2.2 Main Results

As depicted in Table 1, compared with state-of-the-art models and RoBERTa, the performance of our CFO-Net (with double-head linear layer) has gained a lot. The CFO-Net achieves 4.12% gain on Precision, 2.37% gain on Recall and 3.67% gain on M-F1 score compared with RoBERTalarge, which demonstrates that the psychological feedback of the persuadee is beneficial for the strategy recognition. The M-F1 reaches the decent performance with the RoBERTa DialogueRNN where four tasks are jointly trained, which shows that the CFO-Net can achieve better performance with fewer tasks. As for the emotion prediction task, the CFO-Net also improves the performance, which shows that jointly training the tasks can provide benefits and boost each other. This phenomenon illustrates that the emotional feedback of the persuadee has the potential to help the process of strategy recognition task. Our code will be released in the link. 222The codes are available at:  https://github.com/pengwei-iie/CFONETWORK

4.3 Ablation Study

To get a better insight into the components of the CFO-Net, the ablation study is performed in the Table 2. The experiments demonstrate that either component is beneficial to the final results. Note that by removing the feedback memory module, configuration (1) reduces to the RB-base model.

4.3.1 w/o Feedback Memory Module

In this setting, the feedback memory module is abandoned for exploring the effectiveness of the psychological feedback. From the result, the performance has declined significantly in all metrics, which confirms our hypothesis that introducing the emotion of the persuadee to the strategy recognition is important.

Strategy Recognition Emotion Prediction
P \uparrow R \uparrow M-F1 \uparrow Δ(MF1)\Delta_{(M-F1)} P \uparrow R \uparrow M-F1 \uparrow Δ(MF1)\Delta_{(M-F1)}
CFO-Net + RoBERTabase 63.29 67.74 62.41 - 53.08 75.22 61.94 -
(1) w/o feedback memory module 59.58 64.39 58.35 -4.06 53.21 72.05 60.41 -1.54
(2) w/o multi-task learning 59.04 65.17 58.50 -3.91 53.06 72.62 60.52 -1.42
(3) w/o cross-channel fusion 62.44 66.38 60.53 -1.88 53.12 72.97 60.84 -1.10
Table 2: The results of ablation study on model components.

4.3.2 w/o Multi-task Learning

Multi-task learning considers the mutual connection between tasks by sharing latent representations. Here, the emotion prediction task is removed to see the performance of strategy recognition. In Table 2, the multi-task learning that is jointly training (s\mathcal{L}_{s} and e\mathcal{L}_{e}) can provide benefits, which shows that the training objectives are closely related and boost each other.

4.3.3 w/o Cross-channel Fusion

The cross-channel fusion combines the persuader-aware contextual dependency with persuadee-aware emotional dependency. In this setting, these representations are concatenated directly to make a prediction. The results indicate the fusion mechanisms make a contribution to the overall performance.

4.4 Performances on the Fusion Mechanism

The fusion mechanism is adopted to exploit the two types of the interaction, including persuader-aware contextual dependency and persuadee-aware emotional dependency. To further investigate the effectiveness of these mechanisms, a couple of experiments are conducted from two perspectives, as shown in Fig. 6 and Fig 7. One is the comparison between three fusion methods and baselines, the other is to consider the horizontal comparison of the fusion mechanisms.

Refer to caption
Figure 6: The performances on the fusion mechanism. (a), (b), (c) represent the results between the baseline and the MLP, Double-head Linear Layer and Co-interactive Attention Layer, respectively.
Refer to caption
Figure 7: The performances and comparisons on the three different fusion mechanisms.

As shown in Fig. 6, the results conclude that the fusion mechanisms incorporating persuadee-aware emotional dependency into persuader-aware contextual dependency can bring consistent improvements and surpass baselines on all evaluation metrics. In addition, Fig. 7 presents the performances of different fusion mechanisms, in which the double-head linear layer performs best, with the M-F1 score achieving 62.41%. Surprisingly, the co-interactive attention layer underperforms the double-head linear layer. The phenomenon could be attributed to the fact that the strategy representation and the utterance-level dialogue information belong to different levels of abstract semantic information, leading to the introduction of noise after co-attention operation.

4.5 Parameter Analysis

In the feedback memory module, kk is a key hyper-parameter. As shown in Table 3, the model will introduce more noise when kk is set too large, and the confidence score will become lower, leading to worse performance. On the contrary, the enriched semantic representations of the strategy will be ignored when kk is set to one. It shows that although the confidence score is higher, the performance is not the best. The analysis validates that an appropriate kk is crucial to the experimental results.

Top-kk Top-11 Top-22 Top-33 Top-44
M-F1 60.68 62.41 60.42 58.68
Confidence Score 0.877 0.473 0.326 0.242
Table 3: Performance on the hyper-parameter kk. Confidence score indicates kthk^{th} average predicted probability.

4.6 Case Study

A case study is conducted with the example in Fig. 8 to demonstrate how CFO-Net works when recognizing a strategy. We list the possible strategies, the state of the strategy pool and feedback pool, and the updated weights. In this case, two possible strategies are selected to the strategy pool at a time. Then, the predicted emotion and the strategy with the highest score are stored into the feedback pool in a tuple fashion, such as <<Personal-related inquiry, A>> where A represents positive or neutral or negative. Finally, weights γ\gamma will be calculated with Eq. (6). During the conversation, the strategy recognition not only depends on the contextual dialogue information, but also the emotional feedback of the persuadee. The weights are utilized to compute the emotion-aware strategy representation for the final prediction. In the third turn, the CFO-Net outputs a correct prediction emotion appeal rather than personal-related inquiry with the highest score calculated by the contextual dialogue information, which indicates that incorporating the emotion-aware strategy representation into the contextual dialogue information is of great importance.

Refer to caption
Figure 8: An example to illustrate the process of novel feedback memeory module. The red marker indicates the changes.

5 Conclusion

This paper concentrates on incorporating the psychological feedback (emotion of the persuadee) into the recognition of strategies in the persuasive dialogue. In this paper, we propose a novel Cross-channel Feedback memOry Network (CFO-Net), with a feedback memory module and three different cross-channel fusion mechanisms, to model and explore the historical emotional feedback of persuadee. Experimental results and analysis demonstrate that the CFO-Net has strong competitiveness with baselines and significantly improves the performance of strategy recognition. For the future work, some other categories of psychological feedback will be considered with BiLSTM-CRF, such as personal character, educational background and so on. These cognitive factors are still worth researching for persuasion dialogue systems. Furthermore, dialogue systems can utilize the predicted historical strategy chains to guide the dialogue generation task.

6 Acknowledgment

We thank all anonymous reviewers for their constructive comments and we have made some modifications. This work is supported by the National Natural Science Foundation of China (No.U21B2009).

References

  • [1] André, E., Rist, T., Van Mulken, S., Klesen, M., Baldes, S.: The automated design of believable dialogues for animated presentation teams. Embodied conversational agents pp. 220–255 (2000)
  • [2] Baron-Cohen, S., Wheelwright, S.: The empathy quotient: An investigation of adults with asperger syndrome or high functioning autism, and normal sex differences. Journal of Autism and Developmental Disorders 34, 163–175 (2004)
  • [3] Bowden, K.K., Oraby, S., Wu, J., Misra, A., Walker, M.A.: Combining search with structured data to create a more engaging user experience in open domain dialogue. CoRR abs/1709.05411 (2017), http://arxiv.org/abs/1709.05411
  • [4] Chen, H., Ghosal, D., Majumder, N., Hussain, A., Poria, S.: Persuasive dialogue understanding: The baselines and negative results. Neurocomputing 431, 47–56 (2021). https://doi.org/10.1016/j.neucom.2020.11.040, https://doi.org/10.1016/j.neucom.2020.11.040
  • [5] Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT. pp. 4171–4186 (2019)
  • [6] Dutt, R., Sinha, S., Joshi, R., Chakraborty, S.S., Riggs, M., Yan, X., Bao, H., Rosé, C.P.: RESPER: computationally modelling resisting strategies in persuasive conversations. CoRR abs/2101.10545 (2021), https://arxiv.org/abs/2101.10545
  • [7] Ghosal, D., Majumder, N., Mihalcea, R., Poria, S.: Utterance-level dialogue understanding: An empirical study. CoRR abs/2009.13902 (2020), https://arxiv.org/abs/2009.13902
  • [8] He, H., Balakrishnan, A., Eric, M., Liang, P.: Learning symmetric collaborative dialogue agents with dynamic knowledge graph embeddings. In: Barzilay, R., Kan, M. (eds.) ACL. pp. 1766–1776. Association for Computational Linguistics (2017). https://doi.org/10.18653/v1/P17-1162, https://doi.org/10.18653/v1/P17-1162
  • [9] He, H., Chen, D., Balakrishnan, A., Liang, P.: Decoupling strategy and generation in negotiation dialogues. In: Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J. (eds.) EMNLP. pp. 2333–2343. Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/d18-1256, https://doi.org/10.18653/v1/d18-1256
  • [10] Hidey, C., McKeown, K.R.: Persuasive influence detection: The role of argument sequencing. In: McIlraith, S.A., Weinberger, K.Q. (eds.) AAAI. pp. 5173–5180. AAAI Press (2018), https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17077
  • [11] Hidey, C., Musi, E., Hwang, A., Muresan, S., McKeown, K.: Analyzing the semantic types of claims and premises in an online persuasive forum. In: Habernal, I., Gurevych, I., Ashley, K.D., Cardie, C., Green, N., Litman, D.J., Petasis, G., Reed, C., Slonim, N., Walker, V.R. (eds.) Proceedings of the 4th Workshop on Argument Mining, ArgMining@EMNLP 2017, Copenhagen, Denmark, September 8, 2017. pp. 11–21. Association for Computational Linguistics (2017). https://doi.org/10.18653/v1/w17-5102, https://doi.org/10.18653/v1/w17-5102
  • [12] Hilf, B.: Book reviews - the persuasive power of computers (A review of persuasive technology: Using computers to change what we think and do by B.J. fogg). IEEE Distributed Syst. Online 4(6) (2003)
  • [13] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9, 1735–1780 (1997)
  • [14] Iyer, R.R., Sycara, K.P.: An unsupervised domain-independent framework for automated detection of persuasion tactics in text. CoRR abs/1912.06745 (2019), http://arxiv.org/abs/1912.06745
  • [15] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: ICLR (2015)
  • [16] Larionov, G., Kaden, Z., Dureddy, H.V., Kalejaiye, G.B.T., Kale, M., Potharaju, S.P., Shah, A.P., Rudnicky, A.I.: Tartan: A retrieval-based socialbot powered by a dynamic finite-state machine architecture. arXiv preprint arXiv:1812.01260 (2018)
  • [17] Li, Y., Cao, J., Cong, X., Zhang, Z., Yu, B., Zhu, H., Liu, T.: Enhancing chinese pre-trained language model via heterogeneous linguistics graph. In: Muresan, S., Nakov, P., Villavicencio, A. (eds.) Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022. pp. 1986–1996. Association for Computational Linguistics (2022), https://aclanthology.org/2022.acl-long.140
  • [18] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692 (2019)
  • [19] Luong, T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. In: Màrquez, L., Callison-Burch, C., Su, J., Pighin, D., Marton, Y. (eds.) EMNLP. pp. 1412–1421. The Association for Computational Linguistics (2015). https://doi.org/10.18653/v1/d15-1166, https://doi.org/10.18653/v1/d15-1166
  • [20] Nguyen, D., Okatani, T.: Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In: CVPR. pp. 6087–6096. IEEE Computer Society (2018). https://doi.org/10.1109/CVPR.2018.00637
  • [21] Oinas-Kukkonen, H., Harjumaa, M.: Towards deeper understanding of persuasion in software and information systems. First International Conference on Advances in Computer-Human Interaction pp. 200–205 (2008)
  • [22] Pamungkas, E.W.: Emotionally-aware chatbots: A survey. CoRR abs/1906.09774 (2019), http://arxiv.org/abs/1906.09774
  • [23] Partala, T., Surakka, V.: The effects of affective interventions in human-computer interaction. Interact. Comput. 16(2), 295–309 (2004). https://doi.org/10.1016/j.intcom.2003.12.001, https://doi.org/10.1016/j.intcom.2003.12.001
  • [24] Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., Devito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch (2017)
  • [25] Peng, W., Hu, Y., Xing, L., Xie, Y., Sun, Y., Li, Y.: Control globally, understand locally: A global-to-local hierarchical graph network for emotional support conversation. CoRR abs/2204.12749 (2022). https://doi.org/10.48550/arXiv.2204.12749, https://doi.org/10.48550/arXiv.2204.12749
  • [26] Peng, W., Hu, Y., Yu, J., Xing, L., Xie, Y.: Aper: Adaptive evidence-driven reasoning network for machine reading comprehension with unanswerable questions. Knowledge-Based Systems 229, 107364 (2021)
  • [27] Prendinger, H., Ishizuka, M.: The empathic companion: A character-based interface that addresses users’ affective states. Appl. Artif. Intell. 19(3-4), 267–285 (2005). https://doi.org/10.1080/08839510590910174, https://doi.org/10.1080/08839510590910174
  • [28] Scott, J.: Understanding contemporary society: Theories of the present - rational choice theory- complexity theory (2000)
  • [29] Seo, M.J., Kembhavi, A., Farhadi, A., Hajishirzi, H.: Bidirectional attention flow for machine comprehension. In: ICLR (2017), https://openreview.net/forum?id=HJ0UKP9ge
  • [30] Tian, Y., Shi, W., Li, C., Yu, Z.: Understanding user resistance strategies in persuasive conversations. In: Cohn, T., He, Y., Liu, Y. (eds.) EMNLP. pp. 4794–4798. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.findings-emnlp.431, https://doi.org/10.18653/v1/2020.findings-emnlp.431
  • [31] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA. pp. 5998–6008 (2017), https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
  • [32] Wang, Q., Cao, Y., Jiang, J., Wang, Y., Tong, L., Guo, L.: Incorporating specific knowledge into end-to-end task-oriented dialogue systems. 2021 International Joint Conference on Neural Networks (IJCNN) pp. 1–8 (2021)
  • [33] Wang, S., Jiang, J.: Machine comprehension using match-lstm and answer pointer. In: ICLR (2017), https://openreview.net/forum?id=B1-q5Pqxl
  • [34] Wang, X., Shi, W., Kim, R., Oh, Y., Yang, S., Zhang, J., Yu, Z.: Persuasion for good: Towards a personalized persuasive dialogue system for social good. In: Korhonen, A., Traum, D.R., Màrquez, L. (eds.) ACL. pp. 5635–5649. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/p19-1566, https://doi.org/10.18653/v1/p19-1566
  • [35] Yang, D., Chen, J., Yang, Z., Jurafsky, D., Hovy, E.H.: Let’s make your request more persuasive: Modeling persuasive strategies via semi-supervised neural nets on crowdfunding platforms. In: Burstein, J., Doran, C., Solorio, T. (eds.) NAACL-HLT. pp. 3620–3630. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/n19-1364, https://doi.org/10.18653/v1/n19-1364
  • [36] Yuan, T., Moore, D.J., Grierson, A.: A human-computer dialogue system for educational debate: A computational dialectics approach. Int. J. Artif. Intell. Educ. 18(1), 3–26 (2008), http://content.iospress.com/articles/international-journal-of-artificial-intelligence-in-education/jai18-1-02