This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Multimodal Fusion via Hypergraph Autoencoder and Contrastive Learning for Emotion Recognition in Conversation

Zijian Yi 0009-0004-6672-0745 Wuhan University of Technology School of Computer Science and Artificial Intelligence Wuhan Hubei China yzjyjs@whut.edu.cn Ziming Zhao 0009-0000-2606-6567 University of Michigan School of Information Ann Arbor United States zhziming@umich.edu Zhishu Shen 0000-0002-3123-4390 Wuhan University of Technology School of Computer Science and Artificial Intelligence Wuhan Hubei China z˙shen@ieee.org  and  Tiehua Zhang 0000-0002-7195-4472 Tongji University College of Electronics and Information Engineering Shanghai China tiehuaz@tongji.edu.cn
Abstract.

Multimodal emotion recognition in conversation (MERC) seeks to identify the speakers’ emotions expressed in each utterance, offering significant potential across diverse fields. The challenge of MERC lies in balancing speaker modeling and context modeling, encompassing both long-distance and short-distance contexts, as well as addressing the complexity of multimodal information fusion. Recent research adopts graph-based methods to model intricate conversational relationships effectively. Nevertheless, the majority of these methods utilize a fixed fully connected structure to link all utterances, relying on convolution to interpret complex context. This approach can inherently heighten the redundancy in contextual messages and excessive graph network smoothing, particularly in the context of long-distance conversations. To address this issue, we propose a framework that dynamically adjusts hypergraph connections by variational hypergraph autoencoder (VHGAE), and employs contrastive learning to mitigate uncertainty factors during the reconstruction process. Experimental results demonstrate the effectiveness of our proposal against the state-of-the-art methods on IEMOCAP and MELD datasets. We release the code to support the reproducibility of this work at https://github.com/yzjred/-HAUCL.

Multimodal Emotion Recognition in Conversation, Variational Hypergraph Autoencoder, Contrastive Learning, Multimodal Fusion
Corresponding Author: Zhishu Shen and Tiehua Zhang.
ccs: Information systems Sentiment analysisccs: Computing methodologies Discourse, dialogue and pragmatics

1. Introduction

Emotion is one of the crucial characteristics of human behavior (Khare et al., 2024). Experienced psychiatrists can assess emotions by observing an individual’s behavior, which serves as a key indicator for understanding their inclinations and responses. As human-computer interaction (HCI) advances, the capability to discern emotions from dialogues using multimodal information is becoming increasingly significant (Zhao et al., 2021; Shoumy et al., 2020). This process is commonly referred to as multimodal emotion recognition in conversation (MERC). The multimodality herein includes different modal information such as the speaker’s language, tone, facial expression, body movement and so on (Yang et al., 2022; Gandhi et al., 2023). From a modeling perspective, a conversation consists of a sequence of utterances. Each utterance contains one or more modalities of information and is linked to speaker information. The target of MERC is to identify the emotion category of each utterance by analyzing the available information and contextual cues.

Compared with the emotion recognition in non-dialogue scenarios (Hazarika et al., 2022), MERC necessitates a specific emphasis on modelling the speakers involved in the dialogue. Also unlike the analysis of single-modal information (Ghosal et al., 2019), the processing of multimodal information demands the utilization of distinct processing techniques to extract meaningful information from various modalities. Different modalities of information need to be synthesized to facilitate the comprehensive analysis of a conversation. For example, when a speaker utters the word ”ok” with a tone of helplessness, solely relying on textual cues may not fully convey the speaker’s emotional state. By taking into account factors such as intonation and tone assist in inferring the underlying feeling of sadness expressed by the speaker. Efficient integration and utilization of multimodal information play a crucial role in enhancing the precision of emotion recognition during conversations (Wang et al., 2023).

Current research methodologies regarding MERC can be classified into two main categories: non-graph-based method (Poria et al., 2017; Majumder et al., 2019; Hu et al., 2021b) and graph-based method (Ghosal et al., 2019; Hu et al., 2021a, 2022; Li et al., 2023; Chen et al., 2023). Non-graph-based method typically utilizes recurrent neural networks (RNN) or long short-term memory (LSTM) to capture contextual information, while the output utterance representations are used for label classification. However, these methods encounter challenges in modeling long-range dependencies because of issues in information propagation and gradient vanishing problems (Ghosal et al., 2019). Graph-based method typically uses a graph to depict a conversation, with each utterance represented as a node and the relationships between utterances shown through edge weights or connections between nodes. Contextual information is captured through graph convolutions, and the resulting node embeddings are fed into subsequent classification steps (Khare et al., 2024).

Graph-based methods can be further divided into standard graph-based and hypergraph-based methods. Standard graph-based method (Ghosal et al., 2019; Hu et al., 2021a, 2022; Li et al., 2023) is a typical graph-based method, which represents textual information in utterances as nodes and captures contextual relationships by connecting nodes with various types of edges within a specific window size. For the standard graph-based methods, the pairwise connection approach fails to depict the actual physical structure of MERC accurately. Additionally, as the number of graph convolution layers rises, the training time and storage requirements increase exponentially. It can also result in oversmoothing of the graph and redundancy of nodes, potentially leading to inaccurate assessments (Rong et al., 2019).

Hypergraph-based method changes the point-to-point connection to a hyperedge connection structure that more closely fits the model (Chen et al., 2023). Hypergraph is a special graph structure capable of capturing high-order correlations, enabling the exploration of more intricate relationships (Antelmi et al., 2023). By linking multiple modalities within a single utterance and connecting all nodes of the same modality using hyperedges, the hypergraph-based method can achieve outstanding performance improvement. Nevertheless, the fixed fully connected hypergraph structure still results in information redundancy, graph smoothing and slow convergence, especially when processing long-distance conversations (Zhang et al., 2019).

To address the aforementioned issues in existing hypergraph-based methods, we propose a multimodal fusion framework via hypergraph autoencoder and contrastive learning named HAUCL for MERC, which is applicable to multimodal data and capable of adaptively adjusting hypergraph connections. The framework consists of five modules: (1) unimodal encoding, (2) hypergraph construction, (3) hypergraph convolution, (4) hypergraph contrastive learning, and (5) classifier. The unimodal encoding module is designed to generate modality-independent representations. For the hypergraph construction module, it firstly forms an initial fully connected hypergraph structure. Then, a variational hypergraph autoencoder (VHGAE)-based approach (Wei et al., 2022) is introduced to realize adaptive adjustment of the hypergraph. In this paper, we develop VHGAE to map the hypergraph to the latent space to obtain node and hyperedge by sampling from space, and then learn new connections via Gumbel-Softmax (Jang et al., 2016). The aforementioned procedures exhibit a degree of randomness. To minimize the influence of random factors, two parameter-sharing paths are established through the utilization of contrastive learning techniques: Two VHGAEs reconstruct the hypergraph, and the reconstructed hypergraphs are utilized in the subsequent hypergraph convolution module to learn the embeddings along with contextual information. Then, point-to-point hypergraph contrastive learning module is applied to the obtained two hypergraphs, where nodes corresponding to each other in different hypergraphs are considered positive sample pairs to ensure model stability. Conversely, other nodes are treated as negative sample pairs to enhance the learning of more distinctive embeddings. Finally, the learned embeddings are fed into the classifier module for emotion category prediction.

The main contributions of this paper are summarized as follows:

  • We propose a joint learning framework based on hypergraphs, which achieved synergistic optimization of hypergraph reconstruction, contrastive learning, and emotion recognition, leading to globally optimal performance. Specifically, VHGAE is integrated into MERC to adaptively adjust the hypergraph, while Gumbel-Softmax is devised to mitigate data overflow.

  • We utilize contrastive learning to mitigate the impact of uncertainty in the sampling process and the Gumbel-softmax learning process of VHGAE, enhancing the robustness and stability of the model.

  • Extensive experiments conducted on two mainstream MERC datasets, IEMOCAP and MELD, validate the effectiveness of our work. The results showed that our proposal performed superiorly compared to the state-of-the-art methods in accuracy and weighted F1 score.

2. Related Work

2.1. Multimodal Emotion Recognition

Regarding the non-graph learning methods, BC-LSTM captures contextual information from surrounding utterances in three different modalities by using three independent bidirectional LSTM networks, and the output utterance representations are used for label classification (Poria et al., 2017). However, this method lacks the usage of speaker information and thus is not applicable to multi-person conversation scenarios. DialogueRNN utilizes three gate recurrent units (GRUs) to track the global context, speaker state, and emotion state throughout the entire dialogue, which effectively integrates speaker modeling, contextual modeling, and emotion modeling (Majumder et al., 2019). To mimic the human reasoning process, DialogueCRN introduces reasoning modules to integrate the factors that make emotions happen (Hu et al., 2021b).

The standard graph-based methods, such as DialogueGCN (Ghosal et al., 2019), represent textual information in utterances as nodes and capture contextual relationships through different types of edges connecting nodes within a given window size. The multimodal graph convolutional network (MMGCN) develops DialogueGCN by further incorporating audio and video modalities into the model (Hu et al., 2021a). To address the challenge of cross-modal interaction in information fusion within ERC, MIMMN introduces a multi-view network that leverages complementary information from all modalities. It dynamically balances the relationships between all modalities during the fusion process (Wen et al., 2023). MM-DFN (Hu et al., 2022) uses a dynamic fusion mechanism to fully understand the context relationship between multiple modalities and reduce the redundancy between modalities. COGMEN (Joshi et al., 2022) utilizes graph neural network (GNN) to leverage both local and global information in a conversation. GraphMFT (Li et al., 2023) not only designs a multimodal fusion method based on graphs but also utilizes multiple graph attention networks (GATs) to capture the intra-modal contextual details and inter-modal complementary information. M3NET (Chen et al., 2023) introduces the hypergraph into the field of MERC. Through simple fully connected structures and randomly initialized edge weights, significant improvement in prediction accuracy and time efficiency has been achieved by multiple hypergraph convolutions.

Refer to caption
Figure 1. An illustration showcasing the differences between hypergraphs (left) and standard graphs (right).
\Description
Refer to caption
Figure 2. An overview of our proposed framework HAUCL.
\Description

2.2. Hypergraph Learning

A hypergraph acts as an extended version of the standard graph learning, specifically designed to extract high-order correlations within the data (Antelmi et al., 2023; Zhang et al., 2023). The examples of hypergraphs are shown on the left side of Figure 1, with the corresponding standard graphs shown on the right side. The circular dots represent five nodes, i.e., from V1V_{1} to V5V_{5}. Curves with the same color form a hyperedge, and there are three hyperedges e1,e2,e3e_{1},e_{2},e_{3} in total. In a hypergraph, connections are not limited to pairwise relationships as in a standard graph. Hyperedges can link multiple nodes together, and a single node can be linked by multiple hyperedges simultaneously. Meanwhile a hypergraph can include multiple types of hyperedges, representing multiple meanings. In this paper, we create a hypergraph in which all utterances linked to the same speaker are grouped together on a hyperedge, while also connecting similar modality into another hyperedge. This structure closely resembles the physical structure of certain models, capturing higher-order correlations and minimizing information loss during the modeling process. The effectiveness of hypergraph learning in solving the association problem of multimodal data has already been verified in various applications, such as including recommendation system (Xia et al., 2022), video segmentation (Yan et al., 2020), sleep stage classification (Liu et al., 2024; Zhang et al., 2024), and drug-target interaction prediction (Ruan et al., 2021).

Regarding hypergraph convolutions, the learning process involves aggregating node information onto connected hyperedges with varying weights, followed by sending messages from the hyperedges back to the connected nodes. This process is not constrained by distance, thereby mitigating the limitations of message transmission during the process (Gao et al., 2022). These benefits are particularly pronounced in long-distance transmissions (Gao et al., 2020). Therefore, the hypergraph learning process is anticipated to be effective in the MERC task, as speakers frequently discuss topics that are distant from the current conversation, utilizing long-distance cues.

3. Methodology

In this paper, we model the MERC task as follows: a conversation contains a sequence of utterances uiu_{i} (i=1,,Ni=1,...,N). NN is the number of utterances. Each utterance uiu_{i} consists of textual, acoustic, and visual modality, represented as ui={uit,uia,uiv}u_{i}=\{u^{t}_{i},u^{a}_{i},u^{v}_{i}\}, respectively. Meanwhile, each uiu_{i} is spoken by a corresponding person sis_{i}. By integrating the speaker information, an utterance can be denoted as vi=(ui,si)v_{i}=(u_{i},s_{i}). The goal of a MERC task is to predict the emotion label for each utterance viv_{i} based on the given multimodal information. The overall framework of our proposed HAUCL is illustrated in Figure 2. It includes unimodal encoding, hypergraph construction, convolution, contrastive learning and classifier.

3.1. Preprocess and Unimodal Encoding

This module involves extracting essential information from raw visual, textual, and acoustic modalities data. Following the approach outlined in M3NET (Chen et al., 2023), features from visual modalities are extracted using DenseNet (Huang et al., 2016) or 3D-CNN (Yang et al., 2019), depending on the adopted dataset. Features from acoustic and textual modalities are extracted using the OpenSmile toolkit (Eyben et al., 2010) and the RoBERTa large model (Liu et al., 2019) respectively.

As mentioned above, incorporating contextual information is crucial for emotion category prediction in conversations. To enhance discourse feature representation, we employ various encoding methods tailored to the characteristics of different modalities. Specifically, we utilize GRU network (Dey and Salem, 2017) to encode context information for the textual modality, while acoustic and visual information is encoded using two fully connected multilayer perceptrons (MLPs). To facilitate the information fusion across modalities, we normalize the encoded dimension to a unified dd dimension as below:

(1a) Uia=Wauia+biaU_{i}^{a}=W_{a}u_{i}^{a}+b^{a}_{i}
(1b) Uiv=Wvuiv+bivU_{i}^{v}=W_{v}u_{i}^{v}+b^{v}_{i}
(1c) Uit=Wt(GRU(Ui1t,uit,ui+1t))+bitU_{i}^{t}=W_{t}\Bigl{(}\overleftrightarrow{GRU}(U_{i-1}^{t},u_{i}^{t},u_{i+1}^{t})\Bigl{)}+b^{t}_{i}

where uiu_{i} is the input of unimodal encoding. UiU_{i} is the output of the model with the dimension dd. a,v,ta,v,t stands for visual, acoustic, and textual modalities respectively. WW and bb are trainable parameters.

Speaker information is a critical factor that affects the performance of the MERC task. We firstly encode the speaker information into vectors sis_{i} in one-hot form as:

(2) Si=Wssi+bisS_{i}=W_{s}s_{i}+b^{s}_{i}

Next, we integrate them into the modality information by:

(3) Vix=Si+Uix,x{t,a,v}V_{i}^{x}=S_{i}+U_{i}^{x},x\in\{t,a,v\}

The output of this module is ViV_{i}, which is the feature embeddings with modality-independent context awareness and speaker information.

3.2. Hypergraph Construction

This module is composed of three stages: structure initialization, VHGAE, and hypergraph reconstruction.

3.2.1. Structure Initialization

We represent a conversation with continuous utterances through hypergraph 𝒢=(𝒱,)\mathcal{G}=(\mathcal{V,E}), where each node vVv\in V represents a unimodal utterance and each hyperedge hHh\in H captures multimodal dependencies. Each utterance’s modality is represented by a node in a hypergraph, i.e., VitV^{t}_{i} for the textual modality, ViaV^{a}_{i} for the acoustic modality, and VivV^{v}_{i} for the visual modality.

We design two distinct types of hyperedges in this paper: the first one involves connecting every node in a modality to form a hyperedge , i.e., it includes {V1v,V2v,,VNv}\{V_{1}^{v},V_{2}^{v},...,V_{N}^{v}\}, {V1t,V2t,,VNt}\{V_{1}^{t},V_{2}^{t},...,V_{N}^{t}\}, and {V1a,V2a,..,VNa}\{V_{1}^{a},V_{2}^{a},..,V_{N}^{a}\}. The second type of hypergraph creates a hyperedge {Via,Viv,Vit}\{V_{i}^{a},V_{i}^{v},V_{i}^{t}\} by joining the three modalities of an utterance.

Similar to the standard graphs, the incidence matrix for hypergraphs can also be defined as 3N×M\mathcal{H}\in\mathbb{R}^{3N\times M}, where NN and MM is the number of nodes for each modality and hyperedges, respectively. So in the initialization phase, the mathematical relationship between M and N is M=N+3. We define Hi,jH_{i,j} to determine the presence of node ii in hyperedge jj as:

(4) Hi,j={1,node i is included in hyperedge j0,otherwiseH_{i,j}=\left\{\begin{matrix}1\text{,}&\text{node $i$ is included in hyperedge $j$}\\ 0\text{,}&\text{otherwise}\end{matrix}\right.

3.2.2. VHGAE

The fully connected hypergraph generated in the structure initialization stage may lead to redundancy in the subsequent update process, impeding the classification of subtle differences. To mitigate this challenge, we introduce VHGAE to reconstruct the hypergraph, aiming to identify the most appropriate hypergraph structure. VHGAE comprises of three processes: encoder, sampler, and decoder. The structure of VHGAE is illustrated in Figure 3.

Refer to caption
Figure 3. The structure of VHGAE.
\Description

Encoder: It aims to project the hypergraph into a representation consisting of sets of nodes and hyperedges. This projection can facilitate the subsequent decoding process of the non-Euclidean structure, as highlighted in the original paper that proposed variational graph auto-encoders (VGAE) (Kipf and Welling, 2016).

In our proposed method, we follow the VHGAE framework and utilize a hypergraph neural network (HyperGNN) to perform hypergraph convolution on the original hypergraph. This convolution operation produces embeddings for both the nodes vv and the hyperedges ϵ\epsilon as:

(5) v,ϵ=HyperGNN(𝒢)\displaystyle v,\epsilon=\text{HyperGNN}(\mathcal{G})

We utilize the obtained embeddings to encode the mean μ\mu and variance σ\sigma vectors for each type k(v,ϵ)k\in(v,\epsilon). This encoding process involves applying linear transformations and activation functions, as described by the following equations:

(6a) μk=W1k(σ(Wμk(k)+bμk))+b1k\mu_{k}=W^{k}_{1}\Bigl{(}\sigma(W_{\mu}^{k}(k)+b_{\mu}^{k})\Bigl{)}+b^{k}_{1}
(6b) σk=σ1(W2k(σ(Wσk(k)+bσk))+b2k)\sigma_{k}=\sigma_{1}\Bigl{(}W^{k}_{2}(\sigma(W_{\sigma}^{k}(k)+b_{\sigma}^{k}))+b^{k}_{2}\Bigl{)}

where WkW^{k} and bkb^{k} are learnable parameters specific to the type kk. The activation functions σ\sigma and σ1\sigma_{1} correspond to the ReLU and Softplus functions, respectively.

Through encoding the mean and variance vectors with node and hyperedge embeddings, we can effectively capture and represent the crucial information regarding the hypergraph’s structure within a latent space. These encoded vectors will play a pivotal role in the subsequent stages, enabling the generation of meaningful and relevant outputs.

Sample: To incorporate the reparametrization trick, we utilize sampling in the latent space to obtain new nodes and hyperedge embeddings. The sampling process introduces stochasticity while ensuring differentiable computations during the training phase.

To generate the new embeddings, we use the mean μk\mu_{k} and variance σk\sigma_{k} vectors obtained from the encoder process by Equations 6a and 6b. The reparametrization trick involves sampling from a standard normal distribution δN(0,1)\delta\sim N(0,1) and scaling it by the standard deviation σk\sigma_{k}. The obtained sample is then added element-wise to the mean vector μk\mu_{k} to obtain the new embedding mkm_{k} by:

(7) mk=μk+σkδ\displaystyle m_{k}=\mu_{k}+\sigma_{k}\odot\delta

where \odot represents the element-wise product between σk\sigma_{k} and δ\delta. By incorporating the sampled noise δ\delta into the latent space representation, we introduce randomness to the model, while maintaining differentiability for efficient optimization.

The obtained embeddings mkm_{k} serve as the updated representations for the output nodes or hyperedges, capturing the variability and uncertainty within the hypergraph structure.

Decoder: This process aims to reconstruct the hypergraph from the latent space representation. By leveraging the updated embeddings obtained from the encoder process, we can recover the connection structure of the new hypergraph through a series of operations.

First, we calculate the matrix hih_{i} by taking the dot product between the transpose of mϵm_{\epsilon} and mσm_{\sigma} by:

(8) hi=mϵTmσ\displaystyle h_{i}=m_{\epsilon}^{T}m_{\sigma}

where mσTm_{\sigma}^{T} represents the inverse of mσm_{\sigma}.

Next, we apply the Gumbel-Softmax function to the matrix hh with a temperature coefficient τ\tau to introduce stochasticity:

(9) h=softmax(Gumbel_Softmax(hi,τ)+p)\displaystyle h=\text{softmax}\Bigl{(}\text{Gumbel\_Softmax}(h_{i},\tau)+p\Bigl{)}{}

In the above equation, Gumbel_Softmax is a function that applies the Gumbel-Softmax relaxation. To prevent data overflow, we incorporate the addition of a constant pp to the Gumbel-Softmax operation and subsequently apply the softmax function.

After conducting the softmax operation, The obtained matrix hh has two columns, representing a distribution over the hypergraph connections. We extract the first column of matrix hh, which corresponds to the incidence matrix of the new hypergraph 𝒢0=(𝒱,0)\mathcal{G}_{0}=(\mathcal{V},\mathcal{E}_{0}).

By obtaining the connection structure of the new hypergraph through the decoder, we can reconstruct the relationships and connections between nodes and hyperedges. This reconstructed hypergraph can then be further utilized for various downstream tasks.

Loss function: The loss function in VHGAE consists of primary components designed to produce a reconstructed hypergraph that closely resembles the original one. Specifically, the first component measures the Kullback-Leibler (KL) divergence between the distributions of the latent variables (nodes and hyperedges) and their corresponding prior distributions. The second component quantifies the difference in connection structure between the newly generated hypergraph and the original hypergraph. The VHGAE’s loss function g\mathcal{L}_{g} is defined as follows:

(10) g=KL(mσ,σ)+KL(mϵ,ϵ)+CE(h0,h)\displaystyle\mathcal{L}_{g}=\text{KL}(m_{\sigma},\sigma)+\text{KL}(m_{\epsilon},\epsilon)+\text{CE}(h_{0},h)

where KL(mσ,σ)\text{KL}(m_{\sigma},\sigma) measures the KL divergence between the distribution of the sampled latent variables mσm_{\sigma} and the prior distribution σ\sigma. Similarly, KL(mϵ,ϵ)\text{KL}(m_{\epsilon},\epsilon) represents the KL divergence between the distribution of the sampled hyperedge embeddings mϵm_{\epsilon} and the prior distribution ϵ\epsilon. The third term CE(h0,h)\text{CE}(h_{0},h) denotes the cross-entropy loss function, quantifying the connection structure difference between the original hypergraph h0h_{0} and the generated hypergraph hh. This component ensures that the generated hypergraph closely matches the original hypergraph regarding the distribution of connections.

By minimizing this loss function, VHGAE aims to learn an effective latent space representation that captures the essential characteristics of the hypergraph while preserving its connection structure.

3.3. Hypergraph Convolution

With the new hypergraph 𝒢0=(𝒱,0)\mathcal{G}_{0}=(\mathcal{V},\mathcal{E}_{0}), we first perform node convolution by aggregating node features to update the hyperedge embeddings. The aggregation stage facilitates the integration of information from neighboring nodes into the hyperedge representation. Following the update of the hyperedge embeddings, we proceed to the hyperedge convolution stage, where hyperedge messages are disseminated to the nodes. This operation enables the information propagation from hyperedges to their incident nodes. For each hyperedge ϵ0\epsilon\in\mathcal{E}_{0}, we aggregate the embeddings of its incident nodes vv according to a predefined aggregation function by:

(11) nϵ=Agg({nv}vϵ)\displaystyle n_{\epsilon}=\text{Agg}\Bigl{(}\{n_{v}\}_{v\in\epsilon}\Bigl{)}

where nϵn_{\epsilon} represents the updated embedding for the hyperedge ϵ\epsilon and nvn_{v} denotes the embedding of the node vv. The aggregation function Agg combines the embeddings of the incident nodes to generate the new hyperedge embedding. For each node v𝒱v\in\mathcal{V}, we aggregate the messages from its incident hyperedges ϵ\epsilon using a predefined aggregation function similar to Equation 11.

By performing the node and hyperedge convolutions, we can effectively propagate information and update the embeddings in the hypergraph 𝒢1=(𝒱0,0)\mathcal{G}_{1}=(\mathcal{V}_{0},\mathcal{E}_{0}). This reformulated solution enables the capture of the relationships and interactions between nodes and hyperedges, facilitating a more comprehensive understanding of the hypergraph structure.

3.4. Hypergraph Contrastive Learning

In order to mitigate the instability inherent in the sampling and decoding processes, we devise a dual-path scheme within our model. The primary objective is to minimize the dissimilarity between corresponding points in two hypergraphs 𝒢1(1)=(𝒱0(1),0)\mathcal{G}_{1}^{(1)}=(\mathcal{V}_{0}^{(1)},\mathcal{E}_{0}) and 𝒢1(2)=(𝒱0(2),0)\mathcal{G}_{1}^{(2)}=(\mathcal{V}_{0}^{(2)},\mathcal{E}_{0}), which are obtained through the progression of VHGAE and convolution. Concurrently, we aim to maximize the distance between each point and other points within the embedding space.

Within the context of the two hypergraph views, pairs of vertices that correspond to one another are regarded as positive pairs, whereas the remaining vertex pairs are considered negative pairs. The embedding of the ii-th vertex in the two views is denoted as vi(1)𝒱0(1)v_{i}^{(1)}\in\mathcal{V}_{0}^{(1)} and vi(2)𝒱0(2)v_{i}^{(2)}\in\mathcal{V}_{0}^{(2)}. The contrastive loss cl\mathcal{L}_{cl} between 𝒱0(1)\mathcal{V}_{0}^{(1)} and 𝒱0(2)\mathcal{V}_{0}^{(2)} is:

(12) cl=12|𝒱0|i=1|𝒱0|(f(vi(1),vi(2))+f(vi(2),vi(1)))\displaystyle\mathcal{L}_{cl}=\frac{1}{2|\mathcal{V}_{0}|}\sum_{i=1}^{|\mathcal{V}_{0}|}\Bigl{(}f(v_{i}^{(1)},v_{i}^{(2)})+f(v_{i}^{(2)},v_{i}^{(1)})\Bigl{)}

Here, 𝒱0\mathcal{V}_{0} denotes the set of vertices and |𝒱0||\mathcal{V}_{0}| signifies the cardinality of 𝒱0\mathcal{V}_{0}. The term f(vi(1),vi(2))f(v_{i}^{(1)},v_{i}^{(2)}) is calculated as:

(13) f(vi(1),vi(2))=log(q(vi(1),vi(2))q(vi(1),vi(2))+ijq(vi(1),vj(2))+ijq(vi(1),vj(1)))f(v_{i}^{(1)},v_{i}^{(2)})=-\log\Biggl{(}\frac{q(v_{i}^{(1)},v_{i}^{(2)})}{q(v_{i}^{(1)},v_{i}^{(2)})+\displaystyle\sum_{i\neq j}q(v_{i}^{(1)},v_{j}^{(2)})+\displaystyle\sum_{i\neq j}q(v_{i}^{(1)},v_{j}^{(1)})}\Biggl{)}

where

(14) q(x,y)=eg(x,y)τq(x,y)=e^{\frac{g(x,y)}{\tau}}

Here, τ\tau is a temperature parameter and g(,)g(,) denotes the cosine similarity function. Considering that the function g(,)g(,) is not symmetric, we average the positive and negative aspects. Specifically, ijq(vi(1),vj(2))\sum_{i\neq j}q(v_{i}^{(1)},v_{j}^{(2)}) and ijq(vi(1),vj(1))\sum_{i\neq j}q(v_{i}^{(1)},v_{j}^{(1)}) denote the negative pairs in the other graph and the same graph, respectively. Meanwhile, q(vi(1),vi(2))q(v_{i}^{(1)},v_{i}^{(2)}) represents a positive pair in the other graph.

By minimizing the combined loss function cl\mathcal{L}_{cl}, the similarity between corresponding points is expected to increase while enhancing the distance between each point and other points within the embedding space. This approach promotes alignment and discrimination of the embeddings, thereby yielding more stable and meaningful representations of the hypergraph structure.

3.5. Emotion Classifier

After acquiring contextual knowledge, we perform a fusion process on the node embeddings of the two hypergraphs 𝒢1(1)\mathcal{G}_{1}^{(1)} and 𝒢1(2)\mathcal{G}_{1}^{(2)} to obtain 𝒢2=(𝒱2,2)\mathcal{G}_{2}=(\mathcal{V}_{2},\mathcal{E}_{2}). This process aims to integrate the information from the two hypergraphs into a unified representation.

Following that, we concatenate the node embeddings of the three modalities that belong to the same utterance, resulting in a comprehensive representation. Specifically, let {vit,via,viv}𝒱2v^{t}_{i},v^{a}_{i},v^{v}_{i}\}\in\mathcal{V}_{2} denote the node embeddings of the hypergraphs corresponding to the textual, acoustic, and visual modalities, respectively. We concatenate these embeddings to obtain a fused representation by:

(15) vi=Concatenate(vit,via,viv)\displaystyle v_{i}=\text{Concatenate}(v^{t}_{i},v^{a}_{i},v^{v}_{i})

The Concatenate function combines the embeddings of the three modalities into a unified vector, allowing for the integration of multiple sources of information. The fused representation viv_{i} encompasses a broader range of information, enhancing subsequent analysis and prediction tasks by providing a more comprehensive input.

Given the fused representation viv_{i} for an utterance, the formulas for predicting the emotion label are as follows:

(16a) v^i=ReLU(W2vi+b2)\widehat{v}_{i}=ReLU(W_{2}v_{i}+b_{2})
(16b) Pi=softmax(W3v^i+b3)P_{i}=\text{softmax}(W_{3}\widehat{v}_{i}+b_{3})
(16c) y^i=argmax(Pi[τ])\widehat{y}_{i}=\text{argmax}(P_{i}[\tau])

In these formulas, W2W_{2} and W3W_{3} are weight matrices, b2b_{2} and b3b_{3} are bias vectors, v^i\widehat{v}_{i} is the processed output of viv_{i} using the ReLU activation function, PiP_{i} is the probability distribution over the emotion labels, and y^i\widehat{y}_{i} is the predicted emotion label. τ\tau represents the dimension corresponding to the emotion labels. By applying these formulas, we can predict the emotion label for each utterance based on the fused representation viv_{i} and the learned parameters W2W_{2}, W3W_{3}, b2b_{2}, and b3b_{3}.

3.6. Training Objectives

We use categorical cross-entropy loss with L2L2 regularization term to define the error loss between the predicted emotion category and the true label during the training process as below:

(17) ce=1s=1Nc(s)i=1Nj=1c(i)logPi,j[yi,j]+λθ2\mathcal{L}_{ce}=-\frac{1}{\sum_{s=1}^{N}c(s)}\sum_{i=1}^{N}\sum_{j=1}^{c(i)}\log P_{i,j}[y_{i,j}]+\lambda\left\|\theta\right\|_{2}

where NN represents the number of dialogues in a dataset. c(s)c(s) represents the number of utterances in dialogue ss. It is worth noting that each dialogue can have a different number of utterances. Pi,jP_{i,j} denotes the predicted probability distribution of emotion labels for utterance jj in dialogue ii, while yi,jy_{i,j} represents the expected class label. The regularization weight λ\lambda controls the importance of the regularization term relative to the cross-entropy loss.

By combining Equations 17, 10 and 12, we define the final loss function as:

(18) =ce+λgg+λclcl\mathcal{L}=\mathcal{L}_{ce}+\lambda_{g}\mathcal{L}_{g}+\lambda_{cl}\mathcal{L}_{cl}

where the hyperparameter weights λg\lambda_{g} and λcl\lambda_{cl} control the importance of the generalized adversarial loss and the contrastive loss, respectively.

4. Experiment

4.1. Datasets

Table 1. Main hyperparameters for HAUCL.
Dataset Batch size Learning rate λg\lambda_{g} λcl\lambda_{cl} Epoch Dropout
MELD 12 0.0001 0.5 1 15 0.4
IEMOCAP 12 0.0001 0.8 0.1 45 0.3

In this paper, we conduct experiments on two popular multimodal datasets in the field of MERC: the interactive emotional dyadic motion capture database (IEMOCAP(Busso et al., 2008) and multimodal emotionlines dataset (MELD(Poria et al., 2019).

  • IEMOCAP: It contains videos of two-way conversations with 10 actors (5 male and 5 female). IEMOCAP records the tone and power of speech, facial expressions, torso posture, head position, gestures, transcripts, and gaze in a duo session. In this paper, we use facial expressions, the tone and power of speech and transcripts. The emotions in this dataset are artificially classified into six categories: happy, sad, neutral, angry, excited, and frustrated. We use 120 dialogues containing 5,810 utterances for training and validation, while the remaining 31 dialogues with 1623 utterances for testing.

  • MELD: It is a multimodal dataset for emotion recognition in multi-party conversations, containing textual, acoustic and visual modalities for ERC, selected from Friends TV series. This dataset includes seven emotions: neutral, surprise, fear, sadness, happiness, disgust, and anger. We use 1,153 dialogues with 11,098 utterances for training and validation, while the rest 280 dialogues with 2610 utterances for testing.

It is worth noting that IEMOCAP dataset features a fixed set of two speakers engaging in multiple rounds of conversation, whereas MELD dataset may involve multiple speakers but with fewer utterances per conversation. Meanwhile, the emotion distribution within MELD dataset is imbalanced, with a significantly higher proportion of ”neutral” emotions compared to other emotional categories, comprising nearly half of the dataset. These characteristics pose significant challenges to the model’s stability.

4.2. Experimental Settings and Baselines

Table 2. Performance of various methods (Bold font indicates the best performance).
Method IEMOCAP MELD
Emotion Categories (F1) Overall Overall
Happy Sad Neutral Angry Excited Frustrated Acc. WF1 Acc. WF1
bc-LSTM (Poria et al., 2017) 32.62 70.34 51.14 63.44 67.91 61.06 59.58 59.10 59.62 56.80
DialogueRNN (Majumder et al., 2019) 33.18 78.80 59.21 65.28 71.86 58.91 63.40 62.75 60.31 57.66
DialogueCRN (Hu et al., 2021b) 51.59 74.54 62.38 67.25 73.96 59.97 65.31 65.34 59.66 56.76
DialogueGCN (Ghosal et al., 2019) 47.10 80.88 58.71 66.08 70.97 61.21 65.54 65.04 58.62 56.36
MMGCN (Hu et al., 2021a) 45.45 77.53 61.99 66.67 72.04 64.12 65.56 68.71 59.31 57.82
DIMMN (Wen et al., 2023) 30.2 74.2 59.0 62.7 72.5 66.6 64.7 64.1 60.6 58.6
MM-DFN (Hu et al., 2022) 42.22 78.98 66.42 69.77 75.56 66.33 68.21 68.18 62.49 59.46
COGMEN (Joshi et al., 2022) 51.91 81.72 68.61 66.02 75.31 58.23 68.26 67.63 62.53 61.77
GraphMFT (Li et al., 2023) 45.99 83.12 63.08 70.30 76.92 63.84 67.90 68.07 61.30 58.37
M3NET (Chen et al., 2023) 57.96 81.56 68.30 65.59 74.91 63.19 69.01 69.12 67.62 66.15
HAUCL (ours) 53.57 82.04 68.61 66.44 75.60 68.23 70.30 70.27 68.05 66.72
Refer to caption
(a) Effect of λg\lambda_{g}
Refer to caption
(b) Effect of λcl\lambda_{cl}
Refer to caption
(c) Effect of hypergraph layers
Refer to caption
(d) Effect of batch size
Figure 4. Sensitive analysis of HAUCL on MELD dataset. All experiments test the results while fixing all other parameters with the best performance.
\Description

We perform all experiments on an NVIDIA GTX 1050Ti with Win11 operating system. The versions of Pytorch and cuda are 2.1.2 and 11.8, respectively. Adam optimizer is used for training. We set the batch size as 12 and the learning rate as 0.0001 on both datasets. The hyperparameter τ\tau of Gumbel-softmax in Equation 9 is 0.1 . The number of hypergraph convolutions is 1. More details regarding the main parameters can be found in Table 1

In order to validate the performance of the proposed method HAUCL in the MERC task, we introduce the ten state-of-the-art methods for comparison: (1) non-graph learning: LSTM (Poria et al., 2017), DialogueRNN (Majumder et al., 2019), and DialogueCRN (Hu et al., 2021b); (2) standard graph learning: DialogueGCN (Ghosal et al., 2019), MMGCN (Hu et al., 2021a), DIMMN (Wen et al., 2023), MMDFN (Hu et al., 2022), COGMEN (Joshi et al., 2022) and GraphMFT (Li et al., 2023); and (3) hypergraph learning: M3NET (Chen et al., 2023). More details regarding the baseline methods can be found in Section 2.

For validation, we adopt the most mainstream evaluation metrics in this field: accuracy (Acc.) and weighted F1 score (WF1).

4.3. Performance Comparison

Table 2 summaizes the performance of different methods tested on IEMOCAP and MELD datasets. The results show that our proposed HAUCL achieves superior performance in terms of the overall accuracy and weighted F1 score. In detail, compared with M3NET, which achieves the second-best performance, HAUCL enhances the accuracy and WF1 by 0.43% and 0.57% respectively on MELD dataset and by 1.29% and 1.15% on IEMOCAP dataset.

Our model also has advantages in training time and model size.The average training time of each epoch of HAUCL is 38 secs and the model size is 173,945 KB. In comparison, M3NET[3], which obtains the second-best performance, has a training time of 56 secs with the size of 608,867 KB. This superior performance can be attributed to the fact that HAUCL utilizes only one hypergraph convolution layer, whereas M3NET incorporates three convolutions and three high-frequency information convolutions.

The ability of HAUCL to dynamically modify the connection structure of the hypergraph helps in reducing information redundancy, particularly in IEMOCAP dataset with a high average utterance per conversation. Additionally, the use of hypergraphs helps prevent excessive smoothing, reducing the risk of excessive smoothing occurring in standard graph-based methods. Compared with non-graph learning methods, our proposed HAUCL can demonstrate significant enhancement in long-distance information transmission and multimodal information fusion, resulting in satisfactory accuracy and weighted F1 score performance.

4.4. Sensitivity Analysis

We select the following four main parameters in HAUCL for sensitivity analysis tested on MELD dataset. Figures 4 and 4 show the ratio of hypergraph reconstruction loss and contrastive learning loss to the total loss, respectively. Figure 4 represents the number of convolution layers passed by the new hypergraph obtained by reconstruction to learn the contextual information. Figure 4 shows the effect of batch size. Similar trends are also observed on IEMOCAP data.

The weight of the hypergraph reconstruction loss λg\lambda_{g}: It reflects the deviation from the original graph over the total loss (See Equation 18). Higher values of λg\lambda_{g} indicate that the reconstructed hypergraph closely resembles the original graph. As shown in Figure 4, when λg\lambda_{g} is set to 0.5, our method demonstrates optimal performance in accuracy. Meanwhile, deviating from this optimal value, either towards larger or smaller values, results in a decline in the overall performance.

The weight of contrastive learning loss λcl\lambda_{cl}: Similar with λg\lambda_{g}, as the value of λcl\lambda_{cl} increases, the method will increasingly focus on the differences between the two hypergraphs derived from the two paths. Conversely, when the dissimilarity between the two hypergraphs diminishes, our proposal’s capability to withstand interference strengthens. However, when this value is excessively large, it will impact the loss of emotion recognition, i.e., when λcl\lambda_{cl} exceeds 1.1, there is a degradation in accuracy performance as plotted in Figure 4.

Hypergraph layer LL: Figure 4 demonstrates that increasing the number of covolutional layers in a hypergraph does not necessarily lead to enhanced accuracy performance. A large value of LL not only amplifies the model’s complexity and runtime, but also risks oversmoothing, potentially complicating the differentiation of emotions with similar characteristics. When LL is 5, there is a sharp decrease in accuracy performance, indicating an over-smoothing phenomenon.

Batch size: The selection of batch size is a crucial factor that impacts the performance of recognition (Kandel and Castelli, 2020; He et al., 2019). Given the non-uniform distribution of MELD dataset, employing a batch size that is too small can render the model susceptible to the interference of small samples, leading to significant gradient fluctuations and convergence challenges. Conversely, an excessively large batch size may prompt the model to overly generalize, potentially compromising accuracy. As depicted in Figure 4, the best performance is attained with a batch size of 12.

4.5. Ablation Study

Table 3. Performance of HAUCL for ablation study.
Method IEMOCAP MELD
w/o SE 69.19 67.24
w/o GCL 69.52 67.62
w/o CL 69.32 67.47
HAUCL (ours) 70.30 68.05

For a more comprehensive analysis of the effectiveness of our proposed method HAUCL, we conduct ablation experiments from three different aspects: the impact of (1) speaker embedding, (2) VHGAE and contrastive learning, and (3) contrastive learning in terms of accuracy performance. The results are summarized in Table 3.

Impact of Speaker Embedding: Speaker embedding can distinguish the input features from different speakers. Existing research has shown that incorporating speaker information can enhance the accuracy of emotion recognition tasks (Hu et al., 2021a). The exclusion of speaker embedding (”w/o SE” in Table 3) results in a decrease in accuracy, with a degradation of 1.11% and 0.81% observed on IEMOCAP and MELD datasets, respectively. These findings indicate that incorporating person modeling can enhance the model’s performance in the MERC domain.

Impact of VHGAE and Contrastive Learning: We utilize VHGAE for dynamic hyperedge selection to minimize redundancy and employ contrastive learning to mitigate random errors. In Table 3, ”w/o GCL” indicates the direct fusion of two hypergraph convolutions without the inclusion of VHGAE and contrastive learning module. The results demonstrate the effectiveness of our proposed HAUCL: It can improve the accuracy performance by 0.78% and 0.43% on IEMOCAP and MELD datasets respectively.

Impact of Contrastive Learning: ”w/o CL” in Table 3 refers to the model that incorporates hypergraph and VHGAE without the integration of contrastive learning. The experimental results indicate a 0.98% enhancement in accuracy on IEMOCAP dataset and a 0.58% improvement on MELD dataset. These results verify that the contrastive learning module effectively controls the random fluctuations of VHGAE, enhancing model accuracy while reducing information redundancy.

4.6. Visualization

Refer to caption
(a) HAUCL (ours)
Refer to caption
(b) M3NET
Figure 5. Visualization of our proposed HAUCL and M3NET on MELD dataset.
\Description

In order to demonstrate the discriminability of nodes, we present the node representations acquired through our proposed method HAUCL and the M3Net (the second-best method in Table 2) on MELD dataset. To visualize these representations in a more comprehensive manner, we employ t-SNE (Van der Maaten and Hinton, 2008) method for dimensionality reduction, transforming the obtained nodes into three dimensions. Furthermore, we assign distinct colors to indicate the true labels of the nodes. By comparing the two figures in Figure 5, it is evident that the data points depicted in Figure 5 (our method) exhibit greater separation, resulting in a more discriminative segmentation. As aforementioned, the representations derived from the proposed HAUCL exhibit reduced redundancy and enhanced discriminability, thereby enabling the attainment of superior outcomes.

4.7. Time Complexity

This subsection summarizes the time complexity of hypergraph reconstruction. The hypergraph construction in HAUCL includes variantional encoding and decoding part. Starting with encoder, a hypergraph convolution layer is involved with time complexity O(Nd2+NdM)O(Nd^{2}+NdM), where NN is the number of utterances. MM is the number of hyperedges in the initial trivial input graph, and dd is the feature dimension. To generate the latent embedding of mean and variance in Equations 6a and 6b, the time complexity is O(Nd2)O(Nd^{2}). The sampling process can be calculated as O(N+M)O(N+M). In the decoding process, the complexity to compute the incidence matrix is O(NMd2)O(NMd^{2}). In all, the total computational cost of our framework is O(NMd2)O(NMd^{2})\rightarrow O(N2d2)O(N^{2}d^{2}).

5. Conclusion

In this paper, we propose a joint learning framework based on hypergraph learning to improve the performance of MERC. This framework aims to address the issue of excessive redundancy stemming from the fully connected structure of graphs or hypergraphs. The proposed method HAUCL effectively integrates hypergraph adaptive reconstruction and contrastive learning, which reduces information redundancy and enhances accuracy. Experimental results verify the superiority of our proposed method against state-of-the-art ones. In the future, we expect to integrate external knowledge, such as large language models (LLM), into our framework. By focusing on linear labels such as valence-arousal-dominance (VAD) in dimensional emotion space, we aim to substitute classification labels with the goal of enhancing machines’ comprehension of human behavior.

References

  • (1)
  • Antelmi et al. (2023) Alessia Antelmi, Gennaro Cordasco, Mirko Polato, Vittorio Scarano, Carmine Spagnuolo, and Dingqi Yang. 2023. A Survey on Hypergraph Representation Learning. Comput. Surveys 56 (2023), 1–38.
  • Busso et al. (2008) Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Ebrahim (Abe) Kazemzadeh, Emily Mower Provost, Samuel Kim, Jeannette N. Chang, Sungbok Lee, and Shrikanth S. Narayanan. 2008. IEMOCAP: interactive emotional dyadic motion capture database. Language Resources and Evaluation 42 (2008), 335–359.
  • Chen et al. (2023) Feiyu Chen, Jiejing Shao, Shuyuan Zhu, and Heng Tao Shen. 2023. Multivariate, Multi-Frequency and Multimodal: Rethinking Graph Neural Networks for Emotion Recognition in Conversation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10761–10770.
  • Dey and Salem (2017) Rahul Dey and Fathi M Salem. 2017. Gate-variants of gated recurrent unit (GRU) neural networks. In Proceedings of the 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS).
  • Eyben et al. (2010) Florian Eyben, Martin Wöllmer, and Björn Schuller. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the ACM International Conference on Multimedia (MM). 1459–1462.
  • Gandhi et al. (2023) Ankita Gandhi, Kinjal Adhvaryu, Soujanya Poria, Erik Cambria, and Amir Hussain. 2023. Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Information Fusion 91 (2023), 424–444.
  • Gao et al. (2022) Yue Gao, Yifan Feng, Shuyi Ji, and Rongrong Ji. 2022. HGNN+: General Hypergraph Neural Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (2022), 3181–3199.
  • Gao et al. (2020) Yue Gao, Zizhao Zhang, Haojie Lin, Xibin Zhao, S. Du, and Changqing Zou. 2020. Hypergraph Learning: Methods and Practices. IEEE Transactions on Pattern Analysis and Machine Intelligence 44 (2020), 2548–2566.
  • Ghosal et al. (2019) Deepanway Ghosal, Navonil Majumder, Soujanya Poria, Niyati Chhaya, and Alexander Gelbukh. 2019. DialogueGCN: A Graph Convolutional Neural Network for Emotion Recognition in Conversation. In Conference on Empirical Methods in Natural Language Processing (EMNLP). 154–164.
  • Hazarika et al. (2022) Devamanyu Hazarika, Yingting Li, Bo Cheng, Shuai Zhao, Roger Zimmermann, and Soujanya Poria. 2022. Analyzing Modality Robustness in Multimodal Sentiment Analysis. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL). 685–696.
  • He et al. (2019) Fengxiang He, Tongliang Liu, and Dacheng Tao. 2019. Control batch size and learning rate to generalize well: Theoretical and empirical evidence. In Proceedings of the International Conference on Neural Information Processing Systems (NIPS), Vol. 32. 1143–1152.
  • Hu et al. (2022) Dou Hu, Xiaolong Hou, Lingwei Wei, Lian-Xin Jiang, and Yang Mo. 2022. MM-DFN: Multimodal Dynamic Fusion Network for Emotion Recognition in Conversations. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2022), 7037–7041.
  • Hu et al. (2021b) Dou Hu, Lingwei Wei, and Xiaoyong Huai. 2021b. DialogueCRN: Contextual Reasoning Networks for Emotion Recognition in Conversations. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL). 7042–7052.
  • Hu et al. (2021a) Jingwen Hu, Yuchen Liu, Jinming Zhao, and Qin Jin. 2021a. MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conversation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL). 5666–5675.
  • Huang et al. (2016) Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. 2016. Densely Connected Convolutional Networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), 2261–2269.
  • Jang et al. (2016) Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144 (2016).
  • Joshi et al. (2022) Abhinav Joshi, Ashwani Bhat, Ayush Jain, Atin Singh, and Ashutosh Modi. 2022. COGMEN: COntextualized GNN based Multimodal Emotion recognitioN. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). 4148–4164.
  • Kandel and Castelli (2020) Ibrahem Kandel and Mauro Castelli. 2020. The effect of batch size on the generalizability of the convolutional neural networks on a histopathology dataset. ICT express 6 (2020), 312–315.
  • Khare et al. (2024) Smith K. Khare, Victoria Blanes-Vidal, Esmaeil S. Nadimi, and U. Rajendra Acharya. 2024. Emotion recognition and artificial intelligence: A systematic review (2014–2023) and research recommendations. Information Fusion 102 (2024), 102019.
  • Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308 (2016).
  • Li et al. (2023) Jiang Li, Xiaoping Wang, Guoqing Lv, and Zhigang Zeng. 2023. GraphMFT: A graph network based multimodal fusion technique for emotion recognition in conversation. Neurocomputing 550 (2023), 126427.
  • Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
  • Liu et al. (2024) Yuze Liu, Ziming Zhao, Tiehua Zhang, Kang Wang, Xin Chen, Xiaowei Huang, Jun Yin, and Zhishu Shen. 2024. Exploiting Spatial-Temporal Data for Sleep Stage Classification via Hypergraph Learning. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 5430–5434.
  • Majumder et al. (2019) Navonil Majumder, Soujanya Poria, Devamanyu Hazarika, Rada Mihalcea, Alexander Gelbukh, and Erik Cambria. 2019. DialogueRNN: an attentive RNN for emotion detection in conversations. In Proceedings of the AAAI Conference on Artificial Intelligence. 6818–6825.
  • Poria et al. (2017) Soujanya Poria, E. Cambria, Devamanyu Hazarika, Navonil Majumder, Amir Zadeh, and Louis-Philippe Morency. 2017. Context-Dependent Sentiment Analysis in User-Generated Videos. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL). 873–883.
  • Poria et al. (2019) Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. 2019. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL). 527–536.
  • Rong et al. (2019) Yu Rong, Wenbing Huang, Tingyang Xu, and Junzhou Huang. 2019. DropEdge: Towards Deep Graph Convolutional Networks on Node Classification. In Proceedings of the International Conference on Learning Representations (ICLR).
  • Ruan et al. (2021) Ding Ruan, Shuyi Ji, Chenggang Clarence Yan, Junjie Zhu, Xibin Zhao, Yuedong Yang, Yue Gao, Changqing Zou, and Qionghai Dai. 2021. Exploring complex and heterogeneous correlations on hypergraph for the prediction of drug-target interactions. Patterns 2, 12 (2021), 100390.
  • Shoumy et al. (2020) Nusrat J. Shoumy, Li-Minn Ang, Kah Phooi Seng, D.M.Motiur Rahaman, and Tanveer Zia. 2020. Multimodal big data affective analytics: A comprehensive survey using text, audio, visual and physiological signals. Journal of Network and Computer Applications 149 (2020), 102447.
  • Van der Maaten and Hinton (2008) Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).
  • Wang et al. (2023) Yunxiao Wang, Meng Liu, Zhe Li, Yupeng Hu, Xin Luo, and Liqiang Nie. 2023. Unlocking the Power of Multimodal Learning for Emotion Recognition in Conversation. In Proceedings of the ACM International Conference on Multimedia (MM). 5947–5955.
  • Wei et al. (2022) Tianxin Wei, Yuning You, Tianlong Chen, Yang Shen, Jingrui He, and Zhangyang Wang. 2022. Augmentations in Hypergraph Contrastive Learning: Fabricated and Generative. In Proceedings of the Conference on Neural Information Processing Systems (NIPS). 1909–1922.
  • Wen et al. (2023) Jintao Wen, Dazhi Jiang, Geng Tu, Cheng Liu, and Erik Cambria. 2023. Dynamic interactive multiview memory network for emotion recognition in conversation. Information Fusion 91 (2023), 123–133.
  • Xia et al. (2022) Lianghao Xia, Chao Huang, and Chuxu Zhang. 2022. Self-Supervised Hypergraph Transformer for Recommender Systems. In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2100–2109.
  • Yan et al. (2020) Yichao Yan, Jie Qin, Jiaxin Chen, Li Liu, Fan Zhu, Ying Tai, and Ling Shao. 2020. Learning Multi-Granular Hypergraphs for Video-Based Person Re-Identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2899–2908.
  • Yang et al. (2022) Dingkang Yang, Shuai Huang, Haopeng Kuang, Yangtao Du, and Lihua Zhang. 2022. Disentangled Representation Learning for Multimodal Emotion Recognition. In Proceedings of the ACM International Conference on Multimedia (MM). 1642–1651.
  • Yang et al. (2019) Hao Yang, Chunfeng Yuan, Bing Li, Yang Du, Junliang Xing, Weiming Hu, and Stephen J. Maybank. 2019. Asymmetric 3D Convolutional Neural Networks for action recognition. Pattern Recognition 85 (2019), 1–12.
  • Zhang et al. (2019) Hainan Zhang, Yanyan Lan, Liang Pang, J. Guo, and Xueqi Cheng. 2019. ReCoSa: Detecting the Relevant Contexts with Self-Attention for Multi-turn Dialogue Generation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL). 3721–3730.
  • Zhang et al. (2023) Tiehua Zhang, Yuze Liu, Zhishu Shen, Xingjun Ma, Xin Chen, Xiaowei Huang, Jun Yin, and Jiong Jin. 2023. Learning from heterogeneity: A dynamic learning framework for hypergraphs. arXiv preprint arXiv:2307.03411 (2023).
  • Zhang et al. (2024) Tiehua Zhang, Yuze Liu, Zhishu Shen, Rui Xu, Xin Chen, Xiaowei Huang, and Xi Zheng. 2024. An adaptive federated relevance framework for spatial temporal graph learning. IEEE Transactions on Artificial Intelligence 5, 5 (2024), 2227–2240.
  • Zhao et al. (2021) Sicheng Zhao, Guoli Jia, Jufeng Yang, Guiguang Ding, and Kurt Keutzer. 2021. Emotion Recognition From Multiple Modalities: Fundamentals and methodologies. IEEE Signal Processing Magazine 38 (2021), 59–73.