Modality-Transferable Emotion Embeddings for Low-Resource Multimodal Emotion Recognition

Wenliang Dai, Zihan Liu, Tiezheng Yu, Pascale Fung
Center for Artificial Intelligence Research (CAiRE)
Department of Electronic and Computer Engineering
The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong
{wdaiai,zliucr,tyuah}@connect.ust.hk, pascale@ece.ust.hk

Abstract

Despite the recent achievements made in the multi-modal emotion recognition task, two problems still exist and have not been well investigated: 1) the relationship between different emotion categories are not utilized, which leads to sub-optimal performance; and 2) current models fail to cope well with low-resource emotions, especially for unseen emotions. In this paper, we propose a modality-transferable model with emotion embeddings to tackle the aforementioned issues. We use pre-trained word embeddings to represent emotion categories for textual data. Then, two mapping functions are learned to transfer these embeddings into visual and acoustic spaces. For each modality, the model calculates the representation distance between the input sequence and target emotions and makes predictions based on the distances. By doing so, our model can directly adapt to the unseen emotions in any modality since we have their pre-trained embeddings and modality mapping functions. Experiments show that our model achieves state-of-the-art performance on most of the emotion categories. Besides, our model also outperforms existing baselines in the zero-shot and few-shot scenarios for unseen emotions ¹¹1Code is available at https://github.com/wenliangdai/Modality-Transferable-MER.

1 Introduction

Refer to caption — Figure 1: An intuitive example of our method. In the upper image, the relative positions of GloVe emotion embeddings (happy, surprised) are shown in the textual space, which are then projected to acoustic and visual spaces by two mapping functions ( $f_{t\rightarrow a}$ and $f_{t\rightarrow v}$ ). Our model learns to group the representations of input sentences (, ) based on their corresponding emotion embeddings. Examples are shown in the lower image. When a sample has both happy and surprised emotions, its representation gets close to these two emotion embeddings in all three spaces. If an unseen emotion sad () comes, the model processes it with the same flow and recognizes corresponding data samples.

Multi-modal emotion recognition is an increasingly popular but challenging task. One main challenge is that labelled data is difficult to come by as humans find it time-consuming to discern emotion categories from either speech or video. Indeed we humans express emotions through a combination of modalities, including the way we speak, the words we use, facial expressions and sometimes gestures. It is also much more comfortable for humans to understand each other’s emotions when they can both hear and see the other person. It follows that multi-modal emotion recognition can, therefore, yield more reliable results than restricting machines to a single modality.

In the past few years, much research has been done to better understand intra-modality and inter-modality dynamics, and modality fusion is a widely studied approach. For example, Zadeh et al. (2017) proposed a tensor fusion network that combines three modalities from vectors to a tensor using the Cartesian product. In addition, the attention mechanism is commonly used to do modality fusion (Zadeh et al., 2018a; Wang et al., 2018; Liang et al., 2018; Hazarika et al., 2018; Pham et al., 2018; Tsai et al., 2019a). Although significant improvements have been made on the multi-modal emotion recognition task, however, the relationship between emotions has not been well modelled, which can lead to sub-optimal performance. Also, the problem of low-resource multi-modal emotion recognition is not adequately studied. Multi-modal emotion recognition data is hard to collect and annotate, especially for low-resource emotions (e.g., surprise) that are rarely seen in daily life, which motivates us to investigate this problem.

In this paper, we propose a modality-transferable network with cross-modality emotion embeddings to model the relationship between emotions. Given that emotion embeddings contain semantic information and emotion relations in the vector space, we use them to represent emotion categories and measure the similarity of the representations between the input sentence and target emotions to make predictions. Concretely, for the textual modality, we use the pre-trained GloVe (Pennington et al., 2014) embeddings of emotion words as the emotion embeddings. As there are no pre-trained emotion embeddings for the visual and acoustic modalities, the model learns two mapping functions, $f_{t\rightarrow v}$ and $f_{t\rightarrow a}$ , to transfer the emotion embeddings from the textual space to the visual and acoustic spaces (Figure 1). Therefore, for each modality, there will be a dedicated set of emotion embeddings. The distances computed in all modalities will be finally fused, and the model will make predictions based on that.

Benefiting from this prediction mechanism, our model can easily carry out zero-shot learning (ZSL) to identify unseen emotion categories using the embeddings from unseen emotions. The intuition behind it is that the pre-trained and projected emotion embeddings form a semantic knowledge space, which is shared by both the seen and unseen classes. Furthermore, with the help of embedding mapping functions, the model can also perform ZSL on a single modality during inference time. When a few samples from unseen emotions are available, our model can adapt to new emotions without forgetting the previous emotions by using joint training and continual learning (Lopez-Paz and Ranzato, 2017).

Our contributions in this work are three-fold:

•

We introduce a simple but effective end-to-end model for the multi-modal emotion recognition task. It learns the relationship of different emotion categories using emotion embeddings.
•

To the best of our knowledge, this paper is the first to investigate multi-modal emotion recognition in the low-resource scenario. Our model can directly adapt to an unseen emotion, even if only one modality is available.
•

Experimental results show that our model achieves state-of-the-art results on most emotion categories. We also provide a thorough analysis of zero-shot and few-shot learning.

2 Related Works

2.1 Multi-modal Emotion Recognition

Since the early 2010s, multi-modal emotion recognition has drawn more and more attention with the rise of deep learning and its advances in computer vision and natural language processing (Baltrušaitis et al., 2018). Schuller et al. (2011) proposed the first Audio-Visual Emotion Challenge and Workshop (AVEC), which focused on multi-modal emotion analysis for health. In recent years, most achievements in this area aimed to find a better modality fusion method. Zadeh et al. (2017) introduced a tensor fusion network that combined data representation from each modality to a tensor by performing the Cartesian product. In addition, the attention mechanism (Bahdanau et al., 2015) has been widely applied to do modality fusion and emphasis (Zadeh et al., 2018a; Pham et al., 2018; Tsai et al., 2019a). Furthermore, Liu et al. (2018) proposed a low-rank architecture to decrease the problem complexity, and Tsai et al. (2019b) introduced a modality re-construction method to generate occasional missing data in a modality.

Although prior works have made progress on this task, the relationship between emotion categories has not been well modelled in previous works, except by Xu et al. (2020), who captured emotion correlations using graph networks for emotion recognition. However, the model is only based on a single textual modality. Additionally, the previous studies have not put much effort toward unseen and low-resource emotion categories, which is a problem of multi-modal emotion data by nature.

2.2 Zero/Few-Shot and Continual Learning

Zero-shot and few-shot learning methods, which address the data scarcity scenario, have been applied to many popular machine learning tasks where zero or only a few training samples are available for the target tasks or domains, such as machine translation (Johnson et al., 2017; Gu et al., 2018), dialogue generation (Zhao and Eskenazi, 2018; Madotto et al., 2019), dialogue state tracking (Liu et al., 2019c; Wu et al., 2019), slot filling (Bapna et al., 2017; Liu et al., 2019b, 2020), and accented speech recognition (Winata et al., 2020). They have also been adopted in multiple cross-lingual tasks, such as named entity recognition (Xie et al., 2018; Ni et al., 2017), part-of-speech tagging (Wisniewski et al., 2014; Huck et al., 2019), and question answering (Liu et al., 2019a; Lewis et al., 2019). Recently, several methods have been proposed for continual learning (Rusu et al., 2016; Kirkpatrick et al., 2017; Lopez-Paz and Ranzato, 2017; Fernando et al., 2017; Lee et al., 2017), and these were applied to some NLP tasks, such as opinion mining (Shu et al., 2016), document classification (Shu et al., 2017), and dialogue state tracking (Wu et al., 2019).

3 Methodology

As shown in Figure 2, our model consists of three parts: intra-modal encoder networks, emotion embedding mapping modules, and an inter-modal fusion module. In this section, we first define the problem, and then we introduce the details of our model.

3.1 Problem Definition

We define the input multi-modal data samples as $X=\{(t_{i},a_{i},v_{i})\}_{i=1}^{I}$ , in which $I$ denotes the total number of samples, and $t$ , $a$ , $v$ denote the textual, acoustic, and visual modalities, respectively. For each modality, there is a set of emotion embeddings that represent the semantic meanings for the emotion categories to be recognized. In the textual modality, we have $E_{t}=\{e^{t}_{k}\}_{k=1}^{K}$ , which is from the pre-trained GloVe embeddings. In acoustic and visual modalities, we have $E_{a}=\{e^{a}_{k}\}_{k=1}^{K}$ and $E_{v}=\{e^{v}_{k}\}_{k=1}^{K}$ , which are mapped from $E_{t}$ by the mapping function $f_{t\rightarrow v}$ and $f_{t\rightarrow a}$ . $K$ denotes the number of emotion categories, and it can be changed to fit different tasks and zero-shot learning. $Y=\{y_{i}\}_{i=1}^{I}$ denotes the annotations for multi-label emotion recognition, where $y_{i}$ is a vector of length $K$ with binary values.

3.2 Intra-modality Encoder Networks

As shown in Figure 2, for each data sample, there are three sequences of length $T$ from the three modalities. For each modality, we use a bi-directional Long-Short Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) network as the encoder to process the sequence and get a vector representation. In other words, for the $i^{th}$ data sample, we will have three vectors, $r_{t}^{(i)}\in\mathbb{R}^{d_{t}}$ , $r_{a}^{(i)}\in\mathbb{R}^{d_{a}}$ , and $r_{v}^{(i)}\in\mathbb{R}^{d_{v}}$ , that represent the textual, acoustic, and visual modalities. Here, $d_{t}$ , $d_{a}$ , and $d_{v}$ are the dimensions of the emotion embeddings of the textual, acoustic, and visual modalities, respectively.

3.3 Modality Mapping Module

As mentioned in Section 1, previous works do not consider the connections in different emotion categories, and the only information about emotions is in the annotations. In our model, we use emotion word embeddings to inject the semantic information of emotions into the model. Additionally, emotion embeddings also contain the relationships between emotion categories. For the textual modality, we use pre-trained GloVe (Pennington et al., 2014) embeddings of $K$ emotion words, denoted as $E_{t}\in\mathbb{R}^{K\times d_{t}}$ . For the other two modalities, because there are no off-the-shelf pre-trained emotion embeddings, our model learns two mapping functions which project the vectors from the textual space into the acoustic and visual spaces:

	$\displaystyle E_{a}=f_{t\rightarrow a}(E_{t})\in\mathbb{R}^{K\times d_{a}}$		(1)
	$\displaystyle E_{v}=f_{t\rightarrow v}(E_{t})\in\mathbb{R}^{K\times d_{v}}\text{.}$		(2)

3.4 Modality Fusion and Prediction

To predict the emotions for input sentences, we calculate the similarity scores between the sequence representation and the emotion embeddings for each modality. As shown in Eq.3, for a data sample $i$ , every modality will have a vector of similarity scores of length $K$ by dot product attention. We further add a modality fusion module to weighted sum all the vectors, in which the weights are also optimized end-to-end (Eq.4). Finally, as the datasets are multi-labelled, the sigmoid activation function is applied to each score in the fused vector $s^{(i)}$ , and a threshold is used to decide whether an emotion exists or not.

		$\displaystyle s_{t}^{(i)}=E_{t}r_{t}^{(i)}\text{, }s_{a}^{(i)}=E_{a}r_{a}^{(i)}\text{, }s_{v}^{(i)}=E_{v}r_{v}^{(i)}$		(3)
		$\displaystyle s^{(i)}=\text{{Sigmoid} }(w_{t}s_{t}^{(i)}+w_{a}s_{a}^{(i)}+w_{v}s_{v}^{(i)})$		(4)

4 Unseen Emotion Prediction

Collecting numerous training samples for a new emotion, especially for a low-resource emotion, is expensive and time-consuming. Therefore, in this section, we concentrate on the ability of our model to generalize to an unseen target emotion by considering the scenario where we have zero or only a few training samples in an unseen emotion.

4.1 Zero-Shot Emotion Prediction

Ideally, our model is able to directly adapt to a new emotion based on its embedding. Given a new text emotion embedding $e^{t}_{k+1}$ , we can generate the visual and acoustic emotion embeddings $e^{v}_{k+1}$ and $e^{a}_{k+1}$ , respectively, using the already learned mapping functions $f_{t\rightarrow v}$ and $f_{t\rightarrow a}$ . After that, the similarity scores between the input sentence and the new emotion can be computed for each modality.

4.2 Few-Shot Emotion Prediction

In this section, we assume 1% of the positive training samples in a new emotion are available, and to balance the training samples, we take the same amount of negative training samples for the new emotion. However, the model could lose its ability to predict the original emotions when we simply fine-tune it on the training samples of a new emotion. To cope with this issue, we propose two fine-tuning settings. First, after we obtain the trained model in the source emotions, we jointly train it with the training samples of the source emotions and the new target emotion. Second, we utilize a continual learning method, gradient episodic memory (GEM) (Lopez-Paz and Ranzato, 2017), to prevent the catastrophic forgetting of previously learned knowledge. The purpose of using continual learning is that we do not need to retrain with all the data from previously learned emotions since the data might not be available. We describe the training process of GEM as follows:

We define $\Theta_{S}$ as the model’s parameters trained in the source emotions, and $\Theta$ denotes the current optimized parameters based on the target emotion data. GEM keeps a small number of samples N from the source emotions, and a constraint is applied on the gradient to prevent the loss on the stored samples from increasing when the model learns the new target emotion. The training process can be formulated as

	$\displaystyle\text{Minimize}_{\Theta}~{}~{}L(\Theta)$
	$\displaystyle\text{Subject to}~{}~{}L(\Theta,N)\leq L(\Theta_{S},N),$

where $L(\Theta,N)$ is the loss value of the N stored samples.

5 Experiments

In this section, we first introduce the two public datasets we use and data feature extraction. Then, we discuss our evaluation metrics, including their advantages and defects. Finally, we introduce the baselines and our experimental settings.

5.1 Datasets

CMU-MOSEI

CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) (Zadeh et al., 2018b) is currently the largest public dataset for multi-modal sentiment analysis and emotion recognition. It comprises 23,453 annotated data samples extracted from 3228 videos. For emotion recognition, it consists of six basic categories: anger, disgust, fear, happy, sad, and surprise. For zero-shot and few-shot learning evaluation, we use four relatively low-resource categories among them (anger, disgust, fear, surprise). The model is trained on the other five categories when evaluating one zero-shot category. A detailed statistical table about these categories is included in Appendix A.

IEMOCAP

The Interactive Emotional Dyadic Motion Capture (IEMOCAP) (Busso et al., 2008) dataset was created for multi-modal human emotion analysis, and was collected from dialogues performed by ten actors. It is also a multi-labelled emotion recognition dataset which contains nine emotion categories. For comparison with prior works (Wang et al., 2018; Liang et al., 2018; Pham et al., 2018; Tsai et al., 2019a) where four (out of the nine) emotion categories are selected for training and evaluating the models, we also follow the same four categories, namely, happy, sad, angry, and neutral, to train our model. For zero-shot learning evaluation, we consider three low-resource categories from the remaining five, namely, excited, surprised, and frustrated, as unseen emotions.

5.2 Data Feature Extraction

We use CMU-Multimodal SDK (Zadeh et al., 2018c) for downloading and pre-processing the datasets. It helps to do data alignment and early-stage feature extraction for each modality. The textual data is tokenized in word level and represented using GloVe (Pennington et al., 2014) embeddings. Facial action units are extracted by the Facet (iMotions, 2017) to indicate muscle movements and expressions (Ekman et al., 1980). These are a commonly used type of feature for facial expression recognition (Fan et al., 2020). For acoustic data, COVAREP (Degottex et al., 2014) is used to extract fundamental features, such as mel-frequency cepstral coefficients (MFCCs), pitch tracking, glottal source parameters, etc.

Emotion	Anger		Disgust		Fear		Happy		Sad		Surprise		Average
Metrics	W-Acc	AUC	W-Acc	AUC	W-Acc	AUC	W-Acc	AUC	W-Acc	AUC	W-Acc	AUC	W-Acc	AUC
EF-LSTM	58.5	62.2	59.9	63.9	50.1	69.8	65.1	68.9	55.1	58.6	50.6	54.3	56.6	63.0
LF-LSTM	57.7	66.5	61.0	71.9	50.7	61.1	63.9	68.6	54.3	59.6	51.4	61.5	56.5	64.9
Graph-MFN	62.6	-	69.1	-	62.0	-	66.3	-	60.4	-	53.7	-	62.3	-
MTL	66.8	68.0^†	72.7	76.7^†	62.2	42.9^†	53.6	71.4^†	61.4	57.6^†	60.6	65.1^†	62.8	63.6^†
Ours	67.0	71.7	72.5	78.3	65.4	71.6	67.9	73.9	62.6	66.7	62.1	66.4	66.2	71.4

Table 1: Results of multi-modal emotion recognition on the CMU-MOSEI dataset. Baselines (EF-LSTM, LF-LSTM) and previous state-of-the-art models (Graph-MFN (Zadeh et al., 2018b), MTL (Akhtar et al., 2019)) are compared. Results marked by ^† are re-run and fine-tuned by us as they are not reported in the original paper.

Emotion	Happy		Sad		Angry		Neutral
Metrics	Acc	AUC	Acc	AUC	Acc	AUC	Acc	AUC
EF-LSTM	85.8	70.7	83.7	85.8	75.8	90.3	67.1	74.1
LF-LSTM	85.2	71.7	83.4	84.4	79.5	86.8	66.5	72.2
RMFN (Liang et al., 2018)	87.5	-	83.8	-	85.1	-	69.5	-
RAVEN (Wang et al., 2018)	87.3	-	83.4	-	87.3	-	69.7	-
MCTN (Pham et al., 2018)	84.9	-	80.5	-	79.7	-	62.3	-
MulT (Tsai et al., 2019a)	83.5^†	71.2^†	85.0^†	89.3^†	85.5^†	92.4^†	71.0^†	77.2^†
Ours	85.0	74.2	86.6	88.4	88.1	93.2	71.1	76.7

Table 2: Multi-modal emotion recognition results on IEMOCAP. We re-run MulT (marked by ^†) with its reported best hyper-parameters to get the AUC scores.

5.3 Evaluation Metrics

Weighted Accuracy

Due to the imbalanced nature of the emotion recognition dataset (for each emotion category, there are many more negative samples than positive samples), we use binary weighted accuracy (Tong et al., 2017) on each category to better measure the model’s performance. The formula is

\displaystyle\text{Weighted Acc.}=\frac{TP\times N/P+TN}{2N}

where P means total positive, TP true positive, N total negative, and TN true negative.

Weighted F1

In prior works (Zadeh et al., 2018b; Akhtar et al., 2019; Tsai et al., 2019a), the binary weighted F1 score metric is used on the CMU-MOSEI dataset, and its formula is shown in Eq.5.

\displaystyle\text{Weighted F1}=\frac{P}{I}\times\text{F1}_{p}+\frac{N}{I}\times\text{F1}_{n}

(5)

Here, $\text{F1}_{p}$ is the F1 score that treats positive samples as positive, while $\text{F1}_{n}$ treats negative samples as positive, and they are weighted by their portion of the data. However, there is one defect of using binary weighted F1 in this task. As there are many more negative samples than positive ones, we find that with the increase of the threshold, the weighted F1 score will also increase because the true negative increases. Therefore, in this paper, we do not report this metric. A detailed analysis of this is given in Appendix B.

AUC Score

To eliminate the effect of threshold and mitigate the defect of the weighted F1 score, we also report Area under the ROC Curve (AUC) scores. The AUC score considers classification performance on both positive and negative samples, and it is scale- and threshold-invariant.

5.4 Baselines

For both the CMU-MOSEI and IEMOCAP datasets, we use Early Fusion LSTM (EF-LSTM) and Late Fusion LSTM (LF-LSTM) as two baseline models. Additionally, for CMU-MOSEI, the Graph Memory Fusion Network (Graph-MFN) (Zadeh et al., 2018b) and a multi-task learning (MTL) model (Akhtar et al., 2019) are included for comparison with previous state-of-the-art models. For IEMOCAP, the Recurrent Multistage Fusion Network (RMFN) (Liang et al., 2018), Recurrent Attended Variation Embedding Network (RAVEN) (Wang et al., 2018), and the Multimodal Transformer (MulT) (Tsai et al., 2019a) are included. To compare the AUC scores and zero-shot performance with baselines, we re-run the MTL and MulT models based on their reported best hyper-parameters, and we also carry out hyper-parameter search for a fair comparison.

	CMU-MOSEI	IEMOCAP
Best Epoch	15	16
Batch size	512	32
Learning rate	1e-4	1e-3
# LSTM layers	2	2
Hidden Size	300/200/100	300/200/100
Dropout	0.15	0.15
Gradient Clip	10.0	1.0
Random Seed	0	0

Table 3: The hyper-parameters of our best models. The hidden size means the size of the LSTM hidden state of the textual/acoustic/visual modality, respectively.

5.5 Training Details

The model is trained end-to-end with the Adam optimizer (Kingma and Ba, 2015) and a scheduler that will reduce the learning rate by a factor of 0.1 when the optimization stays on a plateau for more than 5 epochs. The best hyper-parameters in our training for both datasets are shown in Table 3. Also, we use the largest GloVe word embeddings (glove.840B.300d ²²2https://nlp.stanford.edu/projects/glove/) for both the input text data and the emotion embeddings in the textual modality. The weights of the textual embeddings are frozen during training to keep the pre-trained relations, which is also essential for doing zero-shot learning.

6 Analysis

Metrics

W-Acc

AUC

W-Acc

AUC

W-Acc

AUC

W-Acc

AUC

Unseen emotion

Anger (unseen)

Disgust (unseen)

Fear (unseen)

Surprise (unseen)

Zero-Shot

EF-LSTM

50.6

50.9

50.3

48.2

45.8

42.3

50.2

46.9

LF-LSTM

48.4

49.2

49.7

44.2

47.4

47.3

48.6

48.3

Ours

55.9

61.6

67.5

72.7

41.8

40.6

53.4

55.5

1% Few-Shot

FT (Ours)

58.9

61.9

67.9

71.5

43.1

51.8

53.9

CL (Ours)

58.9

61.5

68.7

72.8

42.6

42.7

50.6

52.5

JT (Ours)

59.0

61.1

69.2

74.2

41.9

41.7

55.2

58.1

Average on all categories

Except Anger

Except Disgust

Except Fear

Except Surprise

Zero-shot

Ours

65.6

70.6

64.4

69.3

65.9

70.9

67.2

71.4

1% Few-Shot

FT (Ours)

64.4

69.8

63.7

68.5

65.4

70.7

65.1

71.4

CL (Ours)

64.6

69.8

63.8

68.9

65.6

70.9

65.5

71.5

JT (Ours)

64.3

69.3

63.5

68.8

65.9

70.8

66.1

71.5

Table 4: Zero/few-shot results on low-resource emotion categories in CMU-MOSEI dataset. Here, FT, CL, and JT stand for Fine-Tuning, Continual Learning, and Joint Training respectively. FT directly fine-tunes the trained model on the unseen emotions, and CL and JT are two different settings introduced in Section 4.2. Note that in the few-shot settings, we select the model based on the average performance of all emotions (including the unseen emotion) to ensure good overall performance of our model.

6.1 Results

Table 1 shows our model’s performance on the CMU-MOSEI dataset. Compared to existing baselines, our model surpasses them by a large margin. The weighted accuracy (W-Acc) and AUC score are used for evaluation, with a threshold set to 0.5 to calculate the W-Acc. As discussed in Section 5.3, we do not follow the previous papers in using the weighted F1-score (W-F1) because it does not provide an effective evaluation when the dataset is very imbalanced. For example, the weakest baseline, EF-LSTM, can even achieve 90% W-F1 by predicting almost all samples as negative. More plots and analysis of this defect of W-F1 are included in Appendix B.

We further test our model on a second dataset called IEMOCAP, and the results are shown in Table 2. Similarly, our model achieves better results on most emotion categories, except happy. For a fair comparison on IEMOCAP, we use accuracy instead of W-Acc, following the previous works compared in the table.

6.2 Effects of Emotion Embeddings

Quantitatively, our model makes a large improvement in the multi-modal emotion recognition task. We think it benefits greatly from the emotion embeddings, which can model the relationships (or distances) between emotion categories. This is especially important for emotion recognition, which is a multi-label task by nature, as people can have multiple emotions at the same time. For example, if a person is surprised, it is more likely that this person is also happy and excited and is less likely to be disgusted or sad. This kind of information is expected to be modelled and captured by emotion embeddings. Intuitively, in the textual space, related emotions (e.g., angry and disgusted) tend to have closer word vectors than unrelated emotions (angry and happy). To ensure the effectiveness of word embeddings, for each emotion word, we investigated multiple forms of it. For example, for surprised, we also tried with Surprised, (S/s)urprising, (S/s)urprise. Generally, they all show a similar trend, and in most cases, the word form that is used to describe human shows the best results. In our final setting, we iterate and pick the best performing form for each emotion category.

Moreover, our model can transfer the relationship of emotion categories from the textual space to the acoustic and visual spaces using end-to-end optimized mapping functions. In Figure 3, we show the Euclidean distances of emotion embeddings between categories. The relative positions are preserved very well after being transferred from the textual space to the visual and acoustic spaces. This indicates that the learned mapping functions ( $f_{t\rightarrow v}$ and $f_{t\rightarrow a}$ ) are effective. Although it is not the main focus of this paper, we think improving the pre-trained textual emotion embeddings is an essential direction for future work. It can benefit all modalities and further enhance the overall performance. For example, incorporate semantic emotion information (Xu et al., 2018) to the original word embeddings.

6.3 Zero/Few-Shot Results

Benefiting from the pre-trained textual emotion embeddings and learned mapping functions, our model can recognize unseen emotion categories to a certain extent. We evaluate our model’s zero-shot learning ability on the low-resource categories in CMU-MOSEI (shown in Table 4) and IEMOCAP (shown in Table 5). For a fair comparison, we use the same training setting that is used in Table 1. This can ensure that no downgrade happens on the seen emotions, and the model is not selected to overfit a single unseen category. As we can see, the zero-shot results of the baselines are similar to random guesses, because the weights related to that unseen emotion in the model are randomly initialized and have never been optimized. For our model, the zero-shot performance is much better than that of the baselines in almost all emotions. This is because our model learns to classify emotion categories based on the similarity between the sentence representation and emotion embeddings, which enables our model better generalization ability to other unseen emotions since emotion embeddings contain semantic information in the vector space.

Furthermore, we perform few-shot learning using only 1% of data of these low-resource categories. As we can see from Table 4, using very few training samples, our model can adapt to unseen emotions without losing the performance in the source emotions. In addition, we observe that simply fine-tuning (FT) our trained model sometimes obtains inferior performance. This is because our model will gradually lose the ability to classify the source emotions, and we have to early stop the fine-tuning process, which leads to inferior performance. We can see that CL and JT prevent our model from catastrophic forgetting and improve the few-shot performance in the unseen emotion. Moreover, JT achieves slightly better performance than CL. This can be attributed to the fact that CL might still result in performance drops in source emotions since our model only observes partial samples from them. At the same time, JT directly optimizes the model on the data samples of such emotions.

6.4 Ablation Study

Unseen emotion

Excited

(unseen)

Surprised

(unseen)

Frustrated

(unseen)

Metrics

Acc

EF-LSTM

13.1

23.1

11.3

5.1

22.9

37.2

LF-LSTM

14.0

23.3

2.6

5.1

23.7

37.4

MulT

45.1

27.3

41.4

7.5

48.7

40.9

Ours (TAV)

82.0

56.1

78.8

13.1

73.6

57.9

Ours (TA)

79.9

52.7

79.3

14.4

75.1

60.1

Ours (TV)

75.9

42.7

58.6

9.1

54.1

13.6

Ours (AV)

89.1

69.9

65.7

13.2

83.9

73.6

Ours (T)

72.9

37.1

67.7

3.1

55.3

9.0

Ours (A)

76.9

52.8

82.1

16.8

86.1

74.8

Ours (V)

82.1

35.0

81.1

6.2

68.6

44.6

Table 5: Zero-shot results on the IEMOCAP dataset. T (textual), A (acoustic), and V (visual) indicate the existence of that modality during inference time.

Metric	W-Acc
Emotion	Anger	Disgust	Fear	Happy	Sad	Surprise
T+A+V	67.0	72.5	65.4	67.9	62.6	62.1
T+A	65.0	71.9	64.8	66.0	63.0	59.9
T+V	64.9	71.2	66.7	67.6	61.0	60.4
A+V	63.8	71.1	65.5	64.5	61.3	55.2
Only T	61.5	69.0	64.3	64.2	59.7	61.2
Only A	61.9	71.5	66.9	62.7	61.0	54.8
Only V	63.4	69.7	63.2	63.2	58.5	53.3

Table 6: Ablation study on CMU-MOSEI dataset. Different combinations of subsets of modalities are used.

To further investigate how each individual modality influences the model, we perform comprehensive ablation studies on supervised multi-modal emotion recognition and also zero-shot prediction.

In Table 6, we enumerate different subsets of the (textual, acoustic, visual) modalities to evaluate the effect of each single modality. Generally, the performance will increase if more modalities are available. Compared to single-modal data, multi-modal data can provide supplementary information, which leads to more accurate emotion recognition. In terms of a single modality, we find that textual and acoustic are more effective than visual.

Similarly, in Table 5, we show the zero-shot performance with different combinations of modalities during the inference time (all modalities exist in the training phase). As there are many more negative samples than positive ones in the ZSL setting, we also evaluate the models with the unweighted F1 score. Because if a model has high accuracy but a low F1, it is heavily biased to the negative samples so it cannot do classification effectively. Empirical results indicate that zero-shot on only one modality is possible. Moreover, if the data of an emotion category has strong characteristics in one modality and is ambiguous in other modalities, single-modality can even surpass multi-modality on zero-shot prediction. For example, the performance of single-modality zero-shot prediction using the acoustic modality on the surprised category is better than using all modalities.

7 Conclusion

In this paper, we introduce a modality-transferable model that leverages cross-modality emotion embeddings for multi-modal emotion recognition. It makes predictions by measuring the distances between input data and target emotion categories, which is especially effective for a multi-label problem. The model also learns two mapping functions to transfer pre-trained textual emotion embeddings to acoustic and visual spaces. The empirical results demonstrate that it exhibits state-of-the-art performance on most of the categories. Enabled by the utilization of emotion embeddings, our model can carry out zero-shot learning for unseen emotion categories and can quickly adapt few-shot learning without downgrading trained categories.

Acknowledgement

This work is funded by MRP/055/18 of the Innovation Technology Commission, the Hong Kong SAR Government.

References

Akhtar et al. (2019) Md Shad Akhtar, Dushyant Chauhan, Deepanway Ghosal, Soujanya Poria, Asif Ekbal, and Pushpak Bhattacharyya. 2019. Multi-task learning for multi-modal emotion recognition and sentiment analysis. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 370–379, Minneapolis, Minnesota. Association for Computational Linguistics.
Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473.
Baltrušaitis et al. (2018) Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence, 41(2):423–443.
Bapna et al. (2017) Ankur Bapna, Gokhan Tür, Dilek Hakkani-Tür, and Larry Heck. 2017. Towards zero-shot frame semantic parsing for domain scaling. Proc. Interspeech 2017, pages 2476–2480.
Busso et al. (2008) Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower Provost, Samuel Kim, Jeannette N. Chang, Sungbok Lee, and Shrikanth S. Narayanan. 2008. Iemocap: interactive emotional dyadic motion capture database. Language Resources and Evaluation, 42:335–359.
Degottex et al. (2014) Gilles Degottex, John Kane, Thomas Drugman, Tuomo Raitio, and Stefan Scherer. 2014. Covarep — a collaborative voice analysis repository for speech technologies. 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 960–964.
Ekman et al. (1980) Paul Ekman, Wallace V Freisen, and Sonia Ancoli. 1980. Facial signs of emotional experience. Journal of personality and social psychology, 39(6):1125.
Fan et al. (2020) Yingruo Fan, Jacqueline Lam, and Victor Li. 2020. Facial action unit intensity estimation via semantic correspondence learning with dynamic graph convolution. In Thirty-Fourth AAAI Conference on Artificial Intelligence.
Fernando et al. (2017) Chrisantha Fernando, Dylan Banarse, Charles Blundell, Yori Zwols, David Ha, Andrei A Rusu, Alexander Pritzel, and Daan Wierstra. 2017. Pathnet: Evolution channels gradient descent in super neural networks. arXiv preprint arXiv:1701.08734.
Gu et al. (2018) Jiatao Gu, Yong Wang, Yun Chen, Victor OK Li, and Kyunghyun Cho. 2018. Meta-learning for low-resource neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3622–3631.
Hazarika et al. (2018) Devamanyu Hazarika, Soujanya Poria, Amir Zadeh, Erik Cambria, Louis-Philippe Morency, and Roger Zimmermann. 2018. Conversational memory network for emotion recognition in dyadic dialogue videos. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2122–2132, New Orleans, Louisiana. Association for Computational Linguistics.
Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9:1735–1780.
Huck et al. (2019) Matthias Huck, Diana Dutka, and Alexander Fraser. 2019. Cross-lingual annotation projection is effective for neural part-of-speech tagging. In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, pages 223–233.
iMotions (2017) iMotions. 2017. Facial expression analysis. https://imotions.com/biosensor/fea-facial-expression-analysis/. Accessed: 2010-09-30.
Johnson et al. (2017) Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, et al. 2017. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5:339–351.
Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
Kirkpatrick et al. (2017) James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526.
Lee et al. (2017) Sang-Woo Lee, Jin-Hwa Kim, Jaehyun Jun, Jung-Woo Ha, and Byoung-Tak Zhang. 2017. Overcoming catastrophic forgetting by incremental moment matching. In Advances in neural information processing systems, pages 4652–4662.
Lewis et al. (2019) Patrick Lewis, Barlas Oğuz, Ruty Rinott, Sebastian Riedel, and Holger Schwenk. 2019. Mlqa: Evaluating cross-lingual extractive question answering. arXiv preprint arXiv:1910.07475.
Liang et al. (2018) Paul Pu Liang, Ziyin Liu, AmirAli Bagher Zadeh, and Louis-Philippe Morency. 2018. Multimodal language analysis with recurrent multistage fusion. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 150–161, Brussels, Belgium. Association for Computational Linguistics.
Liu et al. (2019a) Jiahua Liu, Yankai Lin, Zhiyuan Liu, and Maosong Sun. 2019a. Xqa: A cross-lingual open-domain question answering dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2358–2368.
Liu et al. (2018) Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang, AmirAli Bagher Zadeh, and Louis-Philippe Morency. 2018. Efficient low-rank multimodal fusion with modality-specific factors. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2247–2256, Melbourne, Australia. Association for Computational Linguistics.
Liu et al. (2019b) Zihan Liu, Jamin Shin, Yan Xu, Genta Indra Winata, Peng Xu, Andrea Madotto, and Pascale Fung. 2019b. Zero-shot cross-lingual dialogue systems with transferable latent variables. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1297–1303.
Liu et al. (2019c) Zihan Liu, Genta Indra Winata, Zhaojiang Lin, Peng Xu, and Pascale Fung. 2019c. Attention-informed mixed-language training for zero-shot cross-lingual task-oriented dialogue systems. arXiv preprint arXiv:1911.09273.
Liu et al. (2020) Zihan Liu, Genta Indra Winata, Peng Xu, and Pascale Fung. 2020. Coach: A coarse-to-fine approach for cross-domain slot filling. arXiv preprint arXiv:2004.11727.
Lopez-Paz and Ranzato (2017) David Lopez-Paz and Marc’Aurelio Ranzato. 2017. Gradient episodic memory for continual learning. In Advances in neural information processing systems, pages 6467–6476.
Madotto et al. (2019) Andrea Madotto, Zhaojiang Lin, Chien-Sheng Wu, and Pascale Fung. 2019. Personalizing dialogue agents via meta-learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5454–5459.
Ni et al. (2017) Jian Ni, Georgiana Dinu, and Radu Florian. 2017. Weakly supervised cross-lingual named entity recognition via effective annotation and representation projection. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1470–1480.
Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543.
Pham et al. (2018) Hai Pham, Paul Pu Liang, Thomas Manzini, Louis-Philippe Morency, and Barnabás Póczos. 2018. Found in translation: Learning robust joint representations by cyclic translations between modalities. ArXiv, abs/1812.07809.
Rusu et al. (2016) Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. 2016. Progressive neural networks. arXiv preprint arXiv:1606.04671.
Schuller et al. (2011) Björn Schuller, Michel Valstar, Florian Eyben, Gary McKeown, Roddy Cowie, and Maja Pantic. 2011. Avec 2011–the first international audio/visual emotion challenge. In Affective Computing and Intelligent Interaction, pages 415–424, Berlin, Heidelberg. Springer Berlin Heidelberg.
Shu et al. (2016) Lei Shu, Bing Liu, Hu Xu, and Annice Kim. 2016. Lifelong-rl: Lifelong relaxation labeling for separating entities and aspects in opinion targets. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, volume 2016, page 225. NIH Public Access.
Shu et al. (2017) Lei Shu, Hu Xu, and Bing Liu. 2017. Doc: Deep open classification of text documents. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2911–2916.
Tong et al. (2017) Edmund Tong, Amir Zadeh, Cara Jones, and Louis-Philippe Morency. 2017. Combating human trafficking with multimodal deep models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1547–1556, Vancouver, Canada. Association for Computational Linguistics.
Tsai et al. (2019a) Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019a. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Florence, Italy. Association for Computational Linguistics.
Tsai et al. (2019b) Yao-Hung Hubert Tsai, Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019b. Learning factorized multimodal representations. ArXiv, abs/1806.06176.
Wang et al. (2018) Yansen Wang, Ying Shen, Zhun Liu, Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2018. Words can shift: Dynamically adjusting word representations using nonverbal behaviors. Proceedings of the … AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence, 33 1:7216–7223.
Winata et al. (2020) Genta Indra Winata, Samuel Cahyawijaya, Zihan Liu, Zhaojiang Lin, Andrea Madotto, Peng Xu, and Pascale Fung. 2020. Learning fast adaptation on cross-accented speech recognition. arXiv preprint arXiv:2003.01901.
Wisniewski et al. (2014) Guillaume Wisniewski, Nicolas Pécheux, Souhir Gahbiche-Braham, and François Yvon. 2014. Cross-lingual part-of-speech tagging through ambiguous learning. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1779–1785.
Wu et al. (2019) Chien-Sheng Wu, Andrea Madotto, Ehsan Hosseini-Asl, Caiming Xiong, Richard Socher, and Pascale Fung. 2019. Transferable multi-domain state generator for task-oriented dialogue systems. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 808–819.
Xie et al. (2018) Jiateng Xie, Zhilin Yang, Graham Neubig, Noah A Smith, and Jaime G Carbonell. 2018. Neural cross-lingual named entity recognition with minimal resources. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 369–379.
Xu et al. (2020) Peng Xu, Zihan Liu, Genta Indra Winata, Zhaojiang Lin, and Pascale Fung. 2020. Emograph: Capturing emotion correlations using graph networks. arXiv preprint arXiv:2008.09378.
Xu et al. (2018) Peng Xu, Andrea Madotto, Chien-Sheng Wu, Ji Ho Park, and Pascale Fung. 2018. Emo2vec: Learning generalized emotion representation by multi-task training. arXiv preprint arXiv:1809.04505.
Zadeh et al. (2017) Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. Tensor fusion network for multimodal sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1103–1114, Copenhagen, Denmark. Association for Computational Linguistics.
Zadeh et al. (2018a) Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018a. Memory fusion network for multi-view sequential learning. In AAAI.
Zadeh et al. (2018b) Amir Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018b. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In ACL.
Zadeh et al. (2018c) Amir Zadeh, Paul Pu Liang, Soujanya Poria, Prateek Vij, Erik Cambria, and Louis-Philippe Morency. 2018c. Multi-attention recurrent network for human communication comprehension. In Thirty-Second AAAI Conference on Artificial Intelligence.
Zhao and Eskenazi (2018) Tiancheng Zhao and Maxine Eskenazi. 2018. Zero-shot dialog generation with cross-domain latent actions. In Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue, pages 1–10.

Appendix A Statistics of Datasets

	Train	Valid	Test
Anger	3443	427	971
Disgust	2720	352	922
Fear	1319	186	332
Happy	8147	1313	2522
Sad	3906	576	1334
Surprise	1562	201	479

Table 7: Statistics of the CMU-MOSEI dataset. Some emotion categories are very low-resource.

	Train	Valid	Test
Happy	338	116	135
Sad	690	188	193
Angry	735	136	227
Neutral	954	358	383
Excited	-	-	141
Surprised	-	-	25
Frustrated	-	-	278

Table 8: Statistics of emotion categories in the IEMOCAP dataset. The three at the bottom are unseen emotions used for the evaluation of zero-shot learning.

Appendix B Weighted F1 Analysis

In Figure 4 and 5, we show the trends of the weighted F1 (W-F1) and weighted accuracy (W-Acc) on the validation set of CMU-MOSEI during the training phase. The lines represent different threshold values as shown in the legend of each figure. The W-F1 is almost proportional to the threshold values and it is still very high when the threshold is 0.9 (i.e. most data samples are predicted to be negative). Moreover, when the threshold is large, the W-F1 keeps a high value starting from epoch 1. By contrast, the W-Acc score is more reliable. It ensures the model can also retrieve positive samples. We observe a similar phenomenon on all models. As a result, we think W-F1 is unsuitable as an evaluation metric on this dataset.