¹¹institutetext: Institute of Information Engineering, Chinese Academy of Sciences, China ²²institutetext: School of Cyber Security, University of Chinese Academy of Sciences, China

Do You Know My Emotion? Emotion-Aware Strategy Recognition towards a Persuasive Dialogue System

Wei Peng 1122 0000-0001-8179-1577 Yue Hu Corresponding author.1122 Luxi Xing 1122 Yuqiang Xie 1122 Yajing Sun 1122

Abstract

Persuasive strategy recognition task requires the system to recognize the adopted strategy of the persuader according to the conversation. However, previous methods mainly focus on the contextual information, little is known about incorporating the psychological feedback, i.e. emotion of the persuadee, to predict the strategy. In this paper, we propose a Cross-channel Feedback memOry Network (CFO-Net) to leverage the emotional feedback to iteratively measure the potential benefits of strategies and incorporate them into the contextual-aware dialogue information. Specifically, CFO-Net designs a feedback memory module, including strategy pool and feedback pool, to obtain emotion-aware strategy representation. The strategy pool aims to store historical strategies and the feedback pool is to obtain updated strategy weight based on feedback emotional information. Furthermore, a cross-channel fusion predictor is developed to make a mutual interaction between the emotion-aware strategy representation and the contextual-aware dialogue information for strategy recognition. Experimental results on PersuasionForGood confirm that the proposed model CFO-Net is effective to improve the performance on M-F1 from 61.74 to 65.41.

Keywords:

Persuasive dialogue Emotional feedback Strategy recognition.

1 Introduction

Persuasive conversation is an essential area in dialogue systems and has embraced a boom in recent NLP research [4, 32, 10, 35]. In a dyadic persuasive dialogue, one party, the persuader, tries to induce another party, the persuadee, to believe something or to do something [14] by a series of persuasion strategies [34]. However, recognizing the persuasion strategy is challenging in the field of natural language understanding since it needs a deeper understanding of conversation, semantic information, and even the psychological feedback of speakers [4, 27, 25]. Furthermore, dialogue systems can utilize the predicted historical strategy chains to guide the dialogue generation task.

Refer to caption — Figure 1: Statistics in the dataset to show relationships between the emotion and strategy.

To make persuasive strategy prediction, mainstream studies [4, 7] focused on the conversational context to recognize strategies. Some work considered resistance strategies to model the strategy conversations, such as [6] and [30]. However, analyzing and understanding speaker’s psychological emotion is an essential job [22, 23] to fully understand the conversation and help persuader to adopt appropriate strategies. Previous methods do not take the persuadee-aware emotional feedback into account thereby fail to model the emotion-aware human persuasive dialogue systems. To illustrate the importance of emotional feedback, the statistics in the dataset have shown the relationships between the emotion and strategy in Fig. 1. The whole plane is divided into four quadrants. As shown in quadrant-I, if the persuadee shows positive emotion after using the strategy $\mathcal{X}$ , the probability of strategy $\mathcal{X}$ continuing to be used is 63% in the following conversation. Similarly, in quadrant-III, when the persuadee shows negative emotion, the probability of the strategy not being used in the subsequent conversation is 75%.

Statistical results indicate that if the strategy obtains positive feedback, it can be given priority in the future. On the contrary, the strategy should be paid less attention [2, 28]. To present the discrepancy between the previous work (a) and ours (b), an example is shown in Fig. 2. Specifically, in the third turn, (a) outputs the wrong prediction personal-related inquiry which has received the negative emotional feedback in the previous turn. Therefore, it would be more appropriate to give priority to a different strategy based on the emotional feedback. This leaves us with: How to model and incorporate emotional feedback into the contextual dialogue information to achieve a better strategy recognition?

In this paper, the proposed Cross-channel Feedback memOry Network (CFO-Net) leverages persuadee’s emotional feedback to iteratively measure the potential benefit of historical strategies, and further the updated representations of strategies are used to guide the strategy recognition. Specifically, the novel feedback memory module designs strategy pool and feedback pool to process and store the historical strategies and update the strategy weight based on the emotional feedback, respectively. Furthermore, the emotion-aware strategy representation and the contextual information are interacted by the designed cross-channel fusion predictor to make the final strategy recognition.

The contributions can be summarized as follows:

•

We propose a CFO-Net to leverage persuadee’s emotional feedback to measure the potential benefit of historical strategies, and incorporate them into context with cross-channel fusion predictor for persuasive strategy recognition.
•

A novel feedback memory module is presented to keep track of the historical strategies and further to obtain the emotion-aware strategy representation in a dynamic and iterative manner.
•

Experiments on the dataset show that the CFO-Net has strong competitiveness with baselines and improves the performance of strategy recognition significantly.

2 Related Work

2.1 Non-Collaborative Dialogue

In collaborative dialogue, systems collaborate and communicate with each other to achieve a common goal [8]. A large number of researches [3, 16, 32] have shown remarkable advancement in the collaborative setting. However, they are out of scope when applied to non-collaborative settings like negotiation or persuasion. For the negotiation task, two agents have a conflict of interest but must strategically communicate to reach an agreement like a bargaining scenario [9]. In this paper, the main focus is on the persuasive scenario, where the persuader tries to induce people to donate [34]. The persuasion strategies are identified as ten categories in [34] that can be divided into two types, 1) persuasive appeal and 2) persuasive inquiry. Specifically, persuasive appeal contains seven strategies (Logical appeal, Emotion appeal, Credibility appeal, Foot-in-the-door, Self-modeling, Personal story and Donation information). For example, personal story refers to the strategy of using narrative examples to state someone’s donation experiences or the beneficiaries’ positive outcomes, which can encourage others to follow the actions. In addition, the three strategies (Task-related inquiry, Personal-related inquiry and Source-related inquiry) belong to persuasive inquiry, which builds better interpersonal relationships by asking questions. For example, source-related inquiry asks whether the persuadee knows about the organization (i.e., the source in our specific donation task).

2.2 Persuasive Dialogue Systems

Persuasive dialogue systems, which have come to increasing attention, aim to change people’s behaviors by persuasive strategies [1, 12, 21, 36]. For instance, [11] proposed a two-tiered annotation scheme to distinguish claims in an online persuasive forum. [10] proposed to predict persuasiveness by modeling argument sequence in social media. [35] designed a hierarchical neural network to identify persuasion strategies. Furthermore, some work focused on the contextual information and modeled the utterances to recognize the strategy. [7] explored and quantified the role of context for different aspects of dialogue for strategy prediction. [4] introduced a transformer-based approach coupled with Conditional Random Field for strategy recognition. A few work considered the resistance strategies to model the strategy conversations like [30] and [6]. The Hybrid-RCNN [34] extracted sentiment embedding features (pos, neg, neu) but did not include the emotion in the history modeling, and ignored the corresponding strategy. To overcome these defects, we present the CFO-Net to leverage the emotional feedback to iteratively measure the potential benefits of strategies and incorporate them into the context.

3 Approach

As shown in Fig. 3, the proposed model consists of three components: (a) a hierarchical encoder, which encodes the contextual dependency with the multi-head attention to capture the semantic information, (b) a feedback memory module, which models the interaction between the strategy pool and the feedback pool to obtain emotion-aware strategy representation, and (c) a cross-channel fusion predictor, which makes an interaction between the emotional feedback and the contextual information, and outputs the final result. Each component is described in the following.

3.1 Hierarchical Encoder

The hierarchical encoder uses a Bi-directional LSTM (BiLSTM) [13] or BERT-style encoder [5, 18, 17], which capture the temporal features within the words. Then, the Multi-head Attention aims to explore the semantic information at different granularity.

3.1.1 Utterance Encoder with BiLSTM

The Utterance Encoder vectorizes an input utterance. Given a historical conversation $C=(u_{1},u_{2},\dots,u_{N})$ a set of ${N}$ utterances, where $u_{i}=(x_{i,1},x_{i,2},\dots,x_{i,T})$ that consists of a sequence of $T$ words, $u_{N}$ indicates the utterance of the persuader, which is uesd to predict the persuasion strategy. A BiLSTM is utilized to encode each word $x_{i,t}$ in the utterance $u_{i}\in C$ , leading to a series of context-aware hidden states $(\textbf{h}_{i,1},\textbf{h}_{i,2},\dots,\textbf{h}_{i,T})$ , $\textbf{h}_{i,t}={\rm{concat}}[~{}\overrightarrow{\textbf{h}_{i,t}}~{};~{}\overleftarrow{\textbf{h}_{i,t}}~{}]$ .

The last hidden state $\textbf{h}_{i,T}$ is considered to get the utterance-level representation. (Note: the representation of the [CLS] is used as the utterance-level representation in BERT-style encoders). Therefore, the set of $N$ utterances in $C$ can be represented as $\textbf{H}=(\textbf{h}_{1,T},\textbf{h}_{2,T},\dots,\textbf{h}_{{N},T})$ .

3.1.2 Utterance-level Multi-head Attention

To explore the semantic information at different granularity, the multi-head attention [31] is adopted as shown in Eq. (1). $\textbf{c}_{i}$ indicates the representation of $i$ -th utterance:

\textbf{c}_{i}={\rm{Multi\verb|-|head~{}Attention}}(\textbf{h}_{i,T})

(1)

3.2 Feedback Memory Module

The proposed feedback memory module is composed of three novel factors. (i) Strategy Embedding represents the features of strategies which will be continuously updated to capture persuasive strategy features. (ii) Strategy Pool temporarily processes and stores all the possible historical strategies for future reference. (iii) Feedback Pool considers the emotional feedback of the persuadee to measure the potential benefits of strategies and updates the strategy weight $\gamma$ . Finally, the strategy pool and feedback pool are interacted to obtain the emotion-aware strategy representation for later strategy recognition.

3.2.1 Strategy Embedding

In the feedback memory module, a randomly initialized strategy embedding is defined to represent the strategy features as $\textbf{S}\in\mathbb{R}^{L\times d}$ , where $L$ is the number of the strategy labels and $d$ indicates the dimension. The strategy embedding will be continuously updated to capture persuasive strategy features. Specificly, CFO-Net selects the appropriate strategies (i.e. top- $k$ ) based on the context from strategy embedding and stores them into the strategy pool with the context-aware softmax function that shown in Eq. (2).

3.2.2 Strategy Pool

Strategy pool aims to process and store the possible historical strategies for future reference. As shown in Fig. 4, to achieve the selection of strategies and prevent gradient truncation, two-stream mask mechanisms are defined in the following:

•

$mask_{p}$ : The selected strategies (i.e. top- $k$ ) are stored into the strategy pool to reserve the possible historical strategies (here, size is set to $10$ ).
•

$mask_{f}$ : The best strategy of the current moment is stored into the feedback pool.

Specificlly, the module first outputs a probability distribution $\alpha$ of the strategy based on the contextual information, as:

\alpha={\rm softmax}({\rm MLP}([\textbf{c}_{1};\dots;\textbf{c}_{N}]))

(2)

Then, the $mask_{p}$ is obtained based on the $\alpha$ with the top- $k$ operation where $k$ is a hyper-parameter, and $mask_{f}$ is obtained when $k=1$ . The strategies $\textbf{S}_{p}$ which contain multiple possible strategies are stored into the strategy pool, as:

\textbf{S}_{p}=\textbf{S}\odot(mask_{p}\otimes\mathbf{e}_{d})

(3)

where $\odot$ is element-wise multiplication, $(\cdot\otimes\mathbf{e}_{d})$ produces a matrix by repeating the vector on the left for $d$ times [33].

The strategies ${\textbf{S}}_{m}$ in the strategy pool are obtained by making a concatenation with the stored strategies $\textbf{S}_{p}$ . Similarly, the strategy $\textbf{S}_{f}$ that stored into the feedback pool is formulated as:

\textbf{S}_{f}=\textbf{S}\odot(mask_{f}\otimes\mathbf{e}_{d})

(4)

3.2.3 Feedback Pool

The purpose of the feedback pool is to update the strategy weight $\gamma$ dynamically to record the emotional feedback of the persuadee towards the strategy. The tuple {strategy, emotion} stored in the pool calculates the strategy weight $\gamma\in\mathbb{R}^{L}$ that is used to obtain the subsequent emotion-aware strategy representation. Firstly, the representation of utterance $\textbf{c}_{i}$ is considered to predict the emotional label $\textbf{y}^{e}\in\{pos,neu,neg\}$ of the persuadee, as:

\textbf{y}^{e}={\rm softmax}({\rm MLP}([\textbf{c}_{1};\dots;\textbf{c}_{N-1}]))

(5)

where $\textbf{c}_{N-1}$ indicates the ${(N-1)}^{th}$ utterance spoken by the persuadee.

Then, the weight $\gamma$ is assigned based on the score of the predicted emotion and stream $mask_{f}$ . To enhance the generalization of the model, soft weight $\gamma\in\mathbb{R}^{L}$ (randomly initialized with an all-one vector at the first) can be defined as:

\gamma_{i}=\begin{cases}\gamma_{i}+{mask_{f}}\cdot\mu\exp^{-\zeta}&if\quad{\rm pos};\\ \gamma_{i}\qquad\qquad\quad\qquad&if\quad{\rm neu};\\ \gamma_{i}-{mask_{f}}\cdot\mu\exp^{-\zeta}&if\quad{\rm neg};\end{cases}

(6)

where the scalar parameter $\mu$ controls the proportion of $\exp^{-\zeta}$ that guarantees to be greater than zero. For the first condition, the weight of $\gamma$ increases when $\zeta$ becomes smaller. To this end, we intuitively set the confidence factor $\zeta$ that depends on the score of emotion $\textbf{y}^{e}$ , as:

\zeta=(1-{y}^{e}_{x})

(7)

where ${y}^{e}_{x}$ is a scalar that indicates the score of the $x\in\{pos,neu,neg\}$ emotion. Finally, the emotion-aware strategy representation $\textbf{S}^{\prime}$ is modeled as:

\textbf{S}^{\prime}=\gamma\cdot\textbf{S}_{m}

(8)

3.3 Cross-channel Fusion Predictor

In this section, the predictor aims to make a recognition of the strategy. Three main types of fusion mechanisms are designed for horizontal comparison in Fig. 5. The mechanisms are introduced to fully interact the psychological feedback with the contextual dialogue information. And the predictor outputs the fusion distribution which captures the profound relationships between two sources.

3.3.1 Multi-layer Perceptron

An MLP can obtain the integrated representation automatically in a simple fashion, as:

\textbf{g}={\rm{MLP}}([\textbf{c}_{1};\dots;\textbf{c}_{N};\textbf{s}_{1}^{\prime};\dots;\textbf{s}_{L}^{\prime}])

(9)

The predicted distribution of the strategy $\textbf{y}^{s}$ can be defined as follows:

\displaystyle\textbf{y}^{s}={\rm{softmax}}

\displaystyle(\textbf{W}^{s}{\textbf{g}}+\textbf{b}_{s})

(10)

where $\textbf{W}^{s}\in\mathbb{R}^{L\times 2d}$ is transformation matrices, $\textbf{b}_{s}\in\mathbb{R}^{L}$ is the bias vector, $L$ is the number of the labels.

3.3.2 Double-head Linear Layer

To achieve the fusion of two probability distribution, a double-head linear layer is designed for prediction. Specifically, we introduce two MLPs to calculate respective probabilities and then combine them, as:

$\displaystyle\textbf{y}^{s}_{1}={\rm{softmax}}($	$\displaystyle{\rm{MLP}}~{}([\textbf{c}_{1};\dots;\textbf{c}_{N}]))$	(11)
$\displaystyle\textbf{y}^{s}_{2}={\rm{softmax}}($	$\displaystyle{\rm{MLP}}~{}([\textbf{s}_{1}^{\prime};\dots;\textbf{s}_{L}^{\prime}]))$	(12)
$\displaystyle\textbf{y}^{s}={\rm{soft}}$	$\displaystyle{\rm{max}}(~{}\textbf{y}^{s}_{1}+\textbf{y}^{s}_{2}~{})$	(13)

where $\textbf{y}^{s}_{1}\in\mathbb{R}^{L}$ and $\textbf{y}^{s}_{2}\in\mathbb{R}^{L}$ , $\textbf{y}^{s}$ is the final predicted distribution of the strategy.

3.3.3 Co-interactive Attention Layer

Motivated by attention mechanism [19, 29, 26], the co-interactive attention layer is proposed to effectively model mutually relational dependency. In this layer, attentions are computed in two directions: from $\textbf{C}=(\textbf{c}_{1},\dots,\textbf{c}_{N})$ to $\textbf{S}^{\prime}=(\textbf{s}_{1}^{\prime},\dots,\textbf{s}_{L}^{\prime})$ as well as from $\textbf{S}^{\prime}$ to C.

Specifically, the layer first yields a shared similarity matrix $A\in\mathbb{R}^{{N}\times L}$ , between C and $\textbf{S}^{\prime}$ . $\textbf{A}_{ij}$ indicates the similarity between $i$ -th context-aware utterance and $j$ -th emotion-aware strategy, as:

{\bf A}_{ij}=\mathcal{F}({\bf C}_{:i},{\bf S}^{\prime}_{:j})

(14)

where $\mathcal{F}$ is a dot product function, ${\bf C}_{:i}$ is $i$ -th row vector of ${\bf C}$ , and ${\bf S}^{\prime}_{:j}$ is $j$ -th row vector of ${\bf S}^{\prime}$ .

The attention weights and the attended vectors can be obtained in both directions. Firstly, considering the direction from $\textbf{S}^{\prime}$ to C, the attention weight is computed by ${\bf a}_{i}=\mathrm{softmax}({\bf A}_{i:})\in\mathbb{R}^{L}$ , and subsequently context-aware utterance vector is $\tilde{{\bf C}}_{:i}=\sum_{j}{\bf a}_{ij}{\bf S}^{\prime}_{:j}$ . Similarly, the attention weight ${\bf b}_{j}=\mathrm{softmax}({\bf A}_{:j})\in\mathbb{R}^{N}$ , and updated emotion-aware strategy vector is $\tilde{{\bf S}}^{\prime}_{:j}=\sum_{i}{\bf b}_{ij}{\bf C}_{i:}$ .

Finally, the context-aware utterance representation and emotion-aware strategy representation are combined to yield g and $\textbf{y}^{s}$ like Eq. (9) and Eq. (10), as:

\textbf{y}^{s}={\rm{softmax}}(\textbf{W}^{s}{\textbf{g}}+\textbf{b}_{s})

(15)

3.4 Training

The objective of strategy and emotion prediction can be formulated as:

	$\displaystyle\mathcal{L}_{s}=-$	$\displaystyle\sum_{i=1}^{D}\hat{{\bf{y}}}_{i}^{s}\log\left({\bf{y}}_{i}^{s}\right)$		(16)
	$\displaystyle\mathcal{L}_{e}=-$	$\displaystyle\sum_{i=1}^{D}\hat{{\bf{y}}}_{i}^{e}\log\left({\bf{y}}_{i}^{e}\right)$		(17)

where $D$ is the number of the training data, ${\hat{{\bf{y}}}_{i}^{s}}$ and ${\hat{{\bf{y}}}_{i}^{e}}$ are gold strategy label and sentiment label, respectively. The joint objective function $\mathcal{L_{\theta}}$ is formulated with the hyper-parameters $\beta$ as, $\mathcal{L_{\theta}}=\beta_{1}\mathcal{L}_{s}+\beta_{2}\mathcal{L}_{e}$ .

4 Experiments

4.1 Experimental Setting

Dataset & Evaluation Metric Considering there is no emotional score in other dataset, we focus on the PersuasionForGood [34] ¹¹1The data are available at: https://gitlab.com/ucdavisnlp/persuasionforgood whose sentiment label can be obtained based on the manually annotated score. The persuader strategies are identified to ten categories (detail in Section 2) and one none category. As for the evaluation metric, Precision, Recall, and Macro F1 (M-F1) are used for the strategy recognition and emotion prediction as the dataset is highly imbalanced [4].

Implementation details The BERT-style baselines have the same hyper parameters given on the paper [5, 18]. Adam optimizer [15] is used for training, with a start learning rate from { ${2}$ e- ${5}$ , $4$ e- $5$ , $6$ e- $5$ , $8$ e- $5$ } and mini-batch size from { ${32}$ , $64$ }. The epoch of training is set from { $3$ , $5$ , ${7}$ , $9$ }. The scalar parameter $\mu$ is set from { ${0.2}$ , $0.5$ }. $k$ is set to ${2}$ based on the parameter analysis. The historical strategies and emotion will be preprocessed to the two pool. To coordinate the joint training of the two training objectives, we set $\beta_{1}$ = $\beta_{2}$ = $0.5$ . Tesla V-100 GPU and PyTorch [24] are used to implement our experiments.

	Strategy Recognition			Emotion Prediction
	P $\uparrow$	R $\uparrow$	M-F1 $\uparrow$	P $\uparrow$	R $\uparrow$	M-F1 $\uparrow$
Hybrid RCNN + All features [34]	62.17*	59.80*	58.76*	-	-	-
RoBERTa_large LogReg [4]	64.88*	68.32*	63.15*	-	-	-
RoBERTa_large cLSTM [4]	/	/	64.10	-	-	-
RoBERTa_large DialogueRNN [4]	/	/	64.30	-	-	-
RoBERTa_base [18]	59.58	64.39	58.35	53.21	72.05	60.41
CFO-Net_base	63.29	67.74	62.41	53.08	75.22	61.94
\cdashline1-7[1pt/1pt] RoBERTa_large [18]	62.69	69.91	61.74	55.49	71.30	62.11
CFO-Net_large	66.81	72.28	65.41	58.11	75.88	63.91

Table 1: Experiments on PersuasionForGood for strategy recognition and emotion prediction. - indicates the baselines don’t take emotional feedback into account, therefore the results are none. * indicates the experiments are implemented by ourselves.

4.2 Experimental Results

4.2.1 Baselines

State-of-the-art models are used as baselines to test the performance. Considering the advantages of pre-trained language models (PLMs), we replace the Bi-LSTM with RoBERTa [18] to strengthen the baseline for fair comparison, as with the work [4]. The base and large PLMs are used in the main experiments for a complete comparison. To increase training speed, the base PLMs are utilized in other experiments. Other baselines are shown in Table 1, [34] considered a hybrid RCNN model to extract textual features. [4] combined the PLMs with some state-of-the-art models to recognize the strategy of the persuader.

4.2.2 Main Results

As depicted in Table 1, compared with state-of-the-art models and RoBERTa, the performance of our CFO-Net (with double-head linear layer) has gained a lot. The CFO-Net achieves 4.12% gain on Precision, 2.37% gain on Recall and 3.67% gain on M-F1 score compared with RoBERTa_large, which demonstrates that the psychological feedback of the persuadee is beneficial for the strategy recognition. The M-F1 reaches the decent performance with the RoBERTa DialogueRNN where four tasks are jointly trained, which shows that the CFO-Net can achieve better performance with fewer tasks. As for the emotion prediction task, the CFO-Net also improves the performance, which shows that jointly training the tasks can provide benefits and boost each other. This phenomenon illustrates that the emotional feedback of the persuadee has the potential to help the process of strategy recognition task. Our code will be released in the link. ²²2The codes are available at: https://github.com/pengwei-iie/CFONETWORK

4.3 Ablation Study

To get a better insight into the components of the CFO-Net, the ablation study is performed in the Table 2. The experiments demonstrate that either component is beneficial to the final results. Note that by removing the feedback memory module, configuration (1) reduces to the RB-base model.

4.3.1 w/o Feedback Memory Module

In this setting, the feedback memory module is abandoned for exploring the effectiveness of the psychological feedback. From the result, the performance has declined significantly in all metrics, which confirms our hypothesis that introducing the emotion of the persuadee to the strategy recognition is important.

	Strategy Recognition				Emotion Prediction
	P $\uparrow$	R $\uparrow$	M-F1 $\uparrow$	$\Delta_{(M-F1)}$	P $\uparrow$	R $\uparrow$	M-F1 $\uparrow$	$\Delta_{(M-F1)}$
CFO-Net + RoBERTa_base	63.29	67.74	62.41	-	53.08	75.22	61.94	-
(1) w/o feedback memory module	59.58	64.39	58.35	-4.06	53.21	72.05	60.41	-1.54
(2) w/o multi-task learning	59.04	65.17	58.50	-3.91	53.06	72.62	60.52	-1.42
(3) w/o cross-channel fusion	62.44	66.38	60.53	-1.88	53.12	72.97	60.84	-1.10

Table 2: The results of ablation study on model components.

4.3.2 w/o Multi-task Learning

Multi-task learning considers the mutual connection between tasks by sharing latent representations. Here, the emotion prediction task is removed to see the performance of strategy recognition. In Table 2, the multi-task learning that is jointly training ( $\mathcal{L}_{s}$ and $\mathcal{L}_{e}$ ) can provide benefits, which shows that the training objectives are closely related and boost each other.

4.3.3 w/o Cross-channel Fusion

The cross-channel fusion combines the persuader-aware contextual dependency with persuadee-aware emotional dependency. In this setting, these representations are concatenated directly to make a prediction. The results indicate the fusion mechanisms make a contribution to the overall performance.

4.4 Performances on the Fusion Mechanism

The fusion mechanism is adopted to exploit the two types of the interaction, including persuader-aware contextual dependency and persuadee-aware emotional dependency. To further investigate the effectiveness of these mechanisms, a couple of experiments are conducted from two perspectives, as shown in Fig. 6 and Fig 7. One is the comparison between three fusion methods and baselines, the other is to consider the horizontal comparison of the fusion mechanisms.

As shown in Fig. 6, the results conclude that the fusion mechanisms incorporating persuadee-aware emotional dependency into persuader-aware contextual dependency can bring consistent improvements and surpass baselines on all evaluation metrics. In addition, Fig. 7 presents the performances of different fusion mechanisms, in which the double-head linear layer performs best, with the M-F1 score achieving 62.41%. Surprisingly, the co-interactive attention layer underperforms the double-head linear layer. The phenomenon could be attributed to the fact that the strategy representation and the utterance-level dialogue information belong to different levels of abstract semantic information, leading to the introduction of noise after co-attention operation.

4.5 Parameter Analysis

In the feedback memory module, $k$ is a key hyper-parameter. As shown in Table 3, the model will introduce more noise when $k$ is set too large, and the confidence score will become lower, leading to worse performance. On the contrary, the enriched semantic representations of the strategy will be ignored when $k$ is set to one. It shows that although the confidence score is higher, the performance is not the best. The analysis validates that an appropriate $k$ is crucial to the experimental results.

Top- $k$	Top- $1$	Top- $2$	Top- $3$	Top- $4$
M-F1	60.68	62.41	60.42	58.68
Confidence Score	0.877	0.473	0.326	0.242

Table 3: Performance on the hyper-parameter

k

. Confidence score indicates

k^{th}

average predicted probability.

4.6 Case Study

A case study is conducted with the example in Fig. 8 to demonstrate how CFO-Net works when recognizing a strategy. We list the possible strategies, the state of the strategy pool and feedback pool, and the updated weights. In this case, two possible strategies are selected to the strategy pool at a time. Then, the predicted emotion and the strategy with the highest score are stored into the feedback pool in a tuple fashion, such as $<$ Personal-related inquiry, A $>$ where A represents positive or neutral or negative. Finally, weights $\gamma$ will be calculated with Eq. (6). During the conversation, the strategy recognition not only depends on the contextual dialogue information, but also the emotional feedback of the persuadee. The weights are utilized to compute the emotion-aware strategy representation for the final prediction. In the third turn, the CFO-Net outputs a correct prediction emotion appeal rather than personal-related inquiry with the highest score calculated by the contextual dialogue information, which indicates that incorporating the emotion-aware strategy representation into the contextual dialogue information is of great importance.

5 Conclusion

This paper concentrates on incorporating the psychological feedback (emotion of the persuadee) into the recognition of strategies in the persuasive dialogue. In this paper, we propose a novel Cross-channel Feedback memOry Network (CFO-Net), with a feedback memory module and three different cross-channel fusion mechanisms, to model and explore the historical emotional feedback of persuadee. Experimental results and analysis demonstrate that the CFO-Net has strong competitiveness with baselines and significantly improves the performance of strategy recognition. For the future work, some other categories of psychological feedback will be considered with BiLSTM-CRF, such as personal character, educational background and so on. These cognitive factors are still worth researching for persuasion dialogue systems. Furthermore, dialogue systems can utilize the predicted historical strategy chains to guide the dialogue generation task.

6 Acknowledgment

We thank all anonymous reviewers for their constructive comments and we have made some modifications. This work is supported by the National Natural Science Foundation of China (No.U21B2009).

References

[1] André, E., Rist, T., Van Mulken, S., Klesen, M., Baldes, S.: The automated design of believable dialogues for animated presentation teams. Embodied conversational agents pp. 220–255 (2000)
[2] Baron-Cohen, S., Wheelwright, S.: The empathy quotient: An investigation of adults with asperger syndrome or high functioning autism, and normal sex differences. Journal of Autism and Developmental Disorders 34, 163–175 (2004)
[3] Bowden, K.K., Oraby, S., Wu, J., Misra, A., Walker, M.A.: Combining search with structured data to create a more engaging user experience in open domain dialogue. CoRR abs/1709.05411 (2017), http://arxiv.org/abs/1709.05411
[4] Chen, H., Ghosal, D., Majumder, N., Hussain, A., Poria, S.: Persuasive dialogue understanding: The baselines and negative results. Neurocomputing 431, 47–56 (2021). https://doi.org/10.1016/j.neucom.2020.11.040, https://doi.org/10.1016/j.neucom.2020.11.040
[5] Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT. pp. 4171–4186 (2019)
[6] Dutt, R., Sinha, S., Joshi, R., Chakraborty, S.S., Riggs, M., Yan, X., Bao, H., Rosé, C.P.: RESPER: computationally modelling resisting strategies in persuasive conversations. CoRR abs/2101.10545 (2021), https://arxiv.org/abs/2101.10545
[7] Ghosal, D., Majumder, N., Mihalcea, R., Poria, S.: Utterance-level dialogue understanding: An empirical study. CoRR abs/2009.13902 (2020), https://arxiv.org/abs/2009.13902
[8] He, H., Balakrishnan, A., Eric, M., Liang, P.: Learning symmetric collaborative dialogue agents with dynamic knowledge graph embeddings. In: Barzilay, R., Kan, M. (eds.) ACL. pp. 1766–1776. Association for Computational Linguistics (2017). https://doi.org/10.18653/v1/P17-1162, https://doi.org/10.18653/v1/P17-1162
[9] He, H., Chen, D., Balakrishnan, A., Liang, P.: Decoupling strategy and generation in negotiation dialogues. In: Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J. (eds.) EMNLP. pp. 2333–2343. Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/d18-1256, https://doi.org/10.18653/v1/d18-1256
[10] Hidey, C., McKeown, K.R.: Persuasive influence detection: The role of argument sequencing. In: McIlraith, S.A., Weinberger, K.Q. (eds.) AAAI. pp. 5173–5180. AAAI Press (2018), https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17077
[11] Hidey, C., Musi, E., Hwang, A., Muresan, S., McKeown, K.: Analyzing the semantic types of claims and premises in an online persuasive forum. In: Habernal, I., Gurevych, I., Ashley, K.D., Cardie, C., Green, N., Litman, D.J., Petasis, G., Reed, C., Slonim, N., Walker, V.R. (eds.) Proceedings of the 4th Workshop on Argument Mining, ArgMining@EMNLP 2017, Copenhagen, Denmark, September 8, 2017. pp. 11–21. Association for Computational Linguistics (2017). https://doi.org/10.18653/v1/w17-5102, https://doi.org/10.18653/v1/w17-5102
[12] Hilf, B.: Book reviews - the persuasive power of computers (A review of persuasive technology: Using computers to change what we think and do by B.J. fogg). IEEE Distributed Syst. Online 4(6) (2003)
[13] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9, 1735–1780 (1997)
[14] Iyer, R.R., Sycara, K.P.: An unsupervised domain-independent framework for automated detection of persuasion tactics in text. CoRR abs/1912.06745 (2019), http://arxiv.org/abs/1912.06745
[15] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: ICLR (2015)
[16] Larionov, G., Kaden, Z., Dureddy, H.V., Kalejaiye, G.B.T., Kale, M., Potharaju, S.P., Shah, A.P., Rudnicky, A.I.: Tartan: A retrieval-based socialbot powered by a dynamic finite-state machine architecture. arXiv preprint arXiv:1812.01260 (2018)
[17] Li, Y., Cao, J., Cong, X., Zhang, Z., Yu, B., Zhu, H., Liu, T.: Enhancing chinese pre-trained language model via heterogeneous linguistics graph. In: Muresan, S., Nakov, P., Villavicencio, A. (eds.) Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022. pp. 1986–1996. Association for Computational Linguistics (2022), https://aclanthology.org/2022.acl-long.140
[18] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692 (2019)
[19] Luong, T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. In: Màrquez, L., Callison-Burch, C., Su, J., Pighin, D., Marton, Y. (eds.) EMNLP. pp. 1412–1421. The Association for Computational Linguistics (2015). https://doi.org/10.18653/v1/d15-1166, https://doi.org/10.18653/v1/d15-1166
[20] Nguyen, D., Okatani, T.: Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In: CVPR. pp. 6087–6096. IEEE Computer Society (2018). https://doi.org/10.1109/CVPR.2018.00637
[21] Oinas-Kukkonen, H., Harjumaa, M.: Towards deeper understanding of persuasion in software and information systems. First International Conference on Advances in Computer-Human Interaction pp. 200–205 (2008)
[22] Pamungkas, E.W.: Emotionally-aware chatbots: A survey. CoRR abs/1906.09774 (2019), http://arxiv.org/abs/1906.09774
[23] Partala, T., Surakka, V.: The effects of affective interventions in human-computer interaction. Interact. Comput. 16(2), 295–309 (2004). https://doi.org/10.1016/j.intcom.2003.12.001, https://doi.org/10.1016/j.intcom.2003.12.001
[24] Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., Devito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch (2017)
[25] Peng, W., Hu, Y., Xing, L., Xie, Y., Sun, Y., Li, Y.: Control globally, understand locally: A global-to-local hierarchical graph network for emotional support conversation. CoRR abs/2204.12749 (2022). https://doi.org/10.48550/arXiv.2204.12749, https://doi.org/10.48550/arXiv.2204.12749
[26] Peng, W., Hu, Y., Yu, J., Xing, L., Xie, Y.: Aper: Adaptive evidence-driven reasoning network for machine reading comprehension with unanswerable questions. Knowledge-Based Systems 229, 107364 (2021)
[27] Prendinger, H., Ishizuka, M.: The empathic companion: A character-based interface that addresses users’ affective states. Appl. Artif. Intell. 19(3-4), 267–285 (2005). https://doi.org/10.1080/08839510590910174, https://doi.org/10.1080/08839510590910174
[28] Scott, J.: Understanding contemporary society: Theories of the present - rational choice theory- complexity theory (2000)
[29] Seo, M.J., Kembhavi, A., Farhadi, A., Hajishirzi, H.: Bidirectional attention flow for machine comprehension. In: ICLR (2017), https://openreview.net/forum?id=HJ0UKP9ge
[30] Tian, Y., Shi, W., Li, C., Yu, Z.: Understanding user resistance strategies in persuasive conversations. In: Cohn, T., He, Y., Liu, Y. (eds.) EMNLP. pp. 4794–4798. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.findings-emnlp.431, https://doi.org/10.18653/v1/2020.findings-emnlp.431
[31] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA. pp. 5998–6008 (2017), https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
[32] Wang, Q., Cao, Y., Jiang, J., Wang, Y., Tong, L., Guo, L.: Incorporating specific knowledge into end-to-end task-oriented dialogue systems. 2021 International Joint Conference on Neural Networks (IJCNN) pp. 1–8 (2021)
[33] Wang, S., Jiang, J.: Machine comprehension using match-lstm and answer pointer. In: ICLR (2017), https://openreview.net/forum?id=B1-q5Pqxl
[34] Wang, X., Shi, W., Kim, R., Oh, Y., Yang, S., Zhang, J., Yu, Z.: Persuasion for good: Towards a personalized persuasive dialogue system for social good. In: Korhonen, A., Traum, D.R., Màrquez, L. (eds.) ACL. pp. 5635–5649. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/p19-1566, https://doi.org/10.18653/v1/p19-1566
[35] Yang, D., Chen, J., Yang, Z., Jurafsky, D., Hovy, E.H.: Let’s make your request more persuasive: Modeling persuasive strategies via semi-supervised neural nets on crowdfunding platforms. In: Burstein, J., Doran, C., Solorio, T. (eds.) NAACL-HLT. pp. 3620–3630. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/n19-1364, https://doi.org/10.18653/v1/n19-1364
[36] Yuan, T., Moore, D.J., Grierson, A.: A human-computer dialogue system for educational debate: A computational dialectics approach. Int. J. Artif. Intell. Educ. 18(1), 3–26 (2008), http://content.iospress.com/articles/international-journal-of-artificial-intelligence-in-education/jai18-1-02