This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Queen Mary University of London, UK
22institutetext: National Technical University of Athens, Greece
22email: d.kollias@qmul.ac.uk

MMA-MRNNet: Harnessing Multiple Models of Affect and Dynamic Masked RNN for Precise Facial Expression Intensity Estimation

Dimitrios Kollias 11    Andreas Psaroudakis 22    Anastasios Arsenos 22    Paraskevi Theofilou 22    Chunchang Shao 11    Guanyu Hu 11    Ioannis Patras 11
Abstract

This paper presents MMA-MRNNet, a novel deep learning architecture for dynamic multi-output Facial Expression Intensity Estimation (FEIE) from video data. Traditional approaches to this task often rely on complex 3-D CNNs, which require extensive pre-training and assume that facial expressions are uniformly distributed across all frames of a video. These methods struggle to handle videos of varying lengths, often resorting to ad-hoc strategies that either discard valuable information or introduce bias. MMA-MRNNet addresses these challenges through a two-stage process. First, the Multiple Models of Affect (MMA) extractor component is a Multi-Task Learning CNN that concurrently estimates valence-arousal, recognizes basic facial expressions, and detects action units in each frame. These representations are then processed by a Masked RNN component, which captures temporal dependencies and dynamically updates weights according to the true length of the input video, ensuring that only the most relevant features are used for the final prediction. The proposed unimodal non-ensemble learning MMA-MRNNet was evaluated on the Hume-Reaction dataset and demonstrated significantly superior performance, surpassing state-of-the-art methods by a wide margin, regardless of whether they were unimodal, multimodal, or ensemble approaches. Finally, we demonstrated the effectiveness of the MMA component of our proposed method across multiple in-the-wild datasets, where it consistently outperformed all state-of-the-art methods across various metrics.

Keywords:
MMA-MRNNet Masked RNN Routing Facial Expression Intensity Estimation ABAW MUSE Hume-Reaction datase5 Valence-Arousal Estimation Basic Expression Recognition Action Unit Detection

1 Introduction

Human emotions are complex, conscious experiences that profoundly influence behavior and can be expressed in various forms. These emotions are pivotal in psychological processes and significantly impact human actions. The advent of Artificial Intelligence (AI) and Deep Learning (DL) has driven the development of intelligent systems capable of recognizing and interpreting human emotions. Psychologists have proposed multiple descriptors to quantify and categorize emotional states: sparse descriptors like facial action units (AUs), which capture specific facial muscle activations [10]; continuous descriptors such as valence and arousal, where valence indicates the positivity or negativity of the emotion, and arousal reflects the level of activation or passivity [50]; and discrete class descriptors like the six basic expressions (anger, disgust, fear, happiness, sadness, surprise) and the neutral state [9]. This paper focuses on dynamic multi-output Facial Expression Intensity Estimation (FEIE), specifically targeting the intensity estimation of expressions such as Adoration, Amusement, Anxiety, Disgust, Empathic-Pain, Fear, and Surprise.

In this paper, we introduce our approach MMA-MRNNet, a novel deep learning architecture designed to tackle the complexities of FEIE in scenarios where video-level annotations (i.e., there exists one annotation for the whole video) are provided rather than frame-level annotations. The key challenges addressed by MMA-MRNNet include handling the variability in video lengths and accurately aggregating temporal information across frames to make a robust final prediction.

Traditional approaches for processing 3-D signals, such as video data, typically employ 3-D CNNs that produce a single prediction per signal. However, these architectures are inherently complex, with a high number of parameters, and often require pre-training on large 3-D datasets to achieve satisfactory performance. Another common approach involves assigning the video-level label uniformly to each frame and then using CNN-RNN networks to train on these annotated frames. This approach assumes that the facial expression intensity is consistent across all frames, which may not be the case, as only a subset of frames might actually display the labeled intensity [24, 1, 23, 26, 31, 34, 39, 12].

Moreover, our approach addresses the challenge of variable-length input videos. Traditional methods often rely on ad-hoc strategies to manage varying numbers of frames, such as setting a fixed input length and either discarding excess frames (which risks losing critical information) or duplicating frames in shorter videos (which can bias the model towards repeated data). These strategies are not only suboptimal but also require empirical tuning for each specific dataset, limiting their generalizability and effectiveness.

MMA-MRNNet comprises two primary components: the Multiple Models of Affect (MMA) extractor and the Masked RNN and Routing Network (MRNN). The MMA component is a Multi-Task Learning (MTL) CNN that extracts affective representations from each frame by concurrently estimating valence-arousal (VA), recognizing the 7 basic expressions, and detecting multiple action units (AUs). To ensure the reliability and consistency of these representations, we introduce a novel loss function that incorporates prior knowledge of the relationships between different affective descriptors, mitigating issues like noisy gradients and poor convergence typically encountered in MTL settings.

The extracted representations are then passed to the MRNN component, which consists of an RNN designed to capture temporal dependencies across the sequence of frames. To handle the varying lengths of input videos, a Mask layer is employed within the MRNN. This layer dynamically selects relevant RNN outputs based on the actual number of frames in the video, allowing the model to adapt to variable input lengths without compromising the integrity of the temporal information. The selected features are then passed through fully connected layers to produce the final intensity estimation for the entire video.

To the best of our knowledge, MMA-MRNNet is the first architecture to leverage valence-arousal, AUs, and basic expressions as intermediate representations for the task of Facial Expression Intensity Estimation. This approach not only enhances the model’s ability to capture the nuanced dynamics of emotional expressions but also provides a robust framework for handling real-world data with varying input conditions.

2 Related Work

[16] presented Supervised Scoring Ensemble (SSE) for emotion recognition. A new fusion structure is presented in which class-wise scoring activations at diverse complementary feature layers are concatenated and used as inputs for second-level supervision, acting as a deep feature ensemble within a single CNN architecture. [60] proposed a deep Visual-Audio Attention Network (VAANet) for video emotion recognition; VAANet integrates spatial, channel-wise, and temporal attentions into a visual 3D CNN and temporal attentions into an audio 2D CNN. A polarity-consistent cross-entropy loss is proposed for guiding the attention generation, which is based on the polarity-emotion hierarchy constraint. [13] constructed an A/V hybrid network to recognize human emotions. A VGG-Face (for extracting per-frame features) and LSTM (for correlating these features according to their temporal dependencies) architecture was used for the visual data.

[44] was the winning method of the Emotional Reaction Intensity (ERI) Estimation Challenge of the 5th ABAW Challenge [40, 18, 41, 42, 33, 27, 29, 30, 19, 20, 35, 25, 21, 36, 31, 22, 37, 34, 26, 1, 23]. This method consists of an audio feature encoding module (based on DenseNet121 and DeepSpectrum), a visual feature encoding module (based on PosterV2-Vit), and an audio-visual modality interaction module. [53] proposed ViPER, a modality agnostic late fusion network that leverages a transformer-based model that combines video frames, audio recordings, and textual annotations for FEIE. [56] proposed a dual-branch FEIE model; the one branch (composed of Temporal CNN and Transformer encoder) handles the visual modality and the other handles the audio one; modality dropout is added for A/V feature fusion. [51] achieved the 3rd place in the ERI challenge of the 5th ABAW; it proposed a methodology that involved extracting features from visual, audio, and text modalities using Vision Transformers, HuBERT, and DeBERTa. Temporal augmentation and SE blocks were applied to enhance temporal generalization [3, 4, 48, 2, 17] and contextual understanding. Features from each modality were then processed through contextual layers and fused using a late fusion strategy. [54] presented a methodology that involved extracting visual features from video frames using models like FAb-Net, EfficientNet, and DAN, which capture facial expressions and attributes. Audio features are obtained using Wav2Vec2 and VGGish models. The extracted features were then processed through a temporal convolutional network to capture local temporal information, followed by a Transformer Encoder to model long-range dependencies with dynamic attention. [59] presented a methodology that involved extracting audio and visual features using state-of-the-art models and aligning these features to a common dimension using an Affine Module. The aligned features were then fused using a Multimodal Multi-Head Attention model.

3 Methodology

Formulation The input to our method is a video consisting of multiple instances (i.e., videoframes), X={x1,,xK}\textbf{{X}}=\{\textbf{x}_{1},...,\textbf{x}_{K}\}, with xKH×W×3\textbf{x}_{K}\in\Re^{H\times W\times 3}. KK is the number of instances (frames), which varies for different videos; HH and WW denote the height and width of the RGB images (frames). There is a video-level label Y. We further assume the instances also have corresponding instance-level labels {y1,,yK}\{\textbf{y}_{1},...,\textbf{y}_{K}\}, which are unknown during training; the instance-level labels (of all instances of the same video) do not necessary match the video-level label. There are NN such video-label pairs constituting the database DB={Xn,Yn}n=1NDB=\{\textbf{{X}}_{n},\textit{Y}_{n}\}^{\textit{N}}_{n=1}. Our objective is to learn an optimal function for predicting the video-level label with the video’s instances as input. To this end, our method should be able to:
1) effectively consider the fact that input videos have variable lengths (in other words, the method should tackle the fact that the total number of frames varies for different videos)
2) aggregate the information of instances {xk}k=1K\{\textbf{{x}}_{k}\}^{K}_{k=1} to make the final decision. A well-adopted aggregation method is the embedding-based approach which maps X to a video-level representation zF\textbf{{z}}\in\Re^{F} and use z to predict Y.

Initially, all videos {Xn}n=1N\{\textbf{{X}}_{n}\}^{N}_{n=1} are padded to a uniform length tt, resulting in video sequences XN={x1,,xt}\textbf{{X}}_{N}=\{\textbf{x}_{1},...,\textbf{x}_{t}\}. Each video X is then processed by the Multiple Models of Affect (MMA) extractor component, which conducts local analysis on each 2-D frame, mapping X to a multiple affect-level representation matrix Z={z1,,zt}d×t\textbf{{Z}}=\{\textbf{z}_{1},...,\textbf{z}_{t}\}\in\Re^{d\times t}. This matrix is subsequently passed to an RNN, positioned on top of the MMA component, to capture temporal dependencies across all {zk}k=1t\{\textbf{{z}}_{k}\}^{t}_{k=1}. The RNN transforms Z into an embedding matrix Z={z1,,zt}d×t\textbf{{Z}}^{\prime}=\{\textbf{z}^{\prime}_{1},...,\textbf{z}^{\prime}_{t}\}\in\Re^{d^{\prime}\times t}, performing global analysis over the entire video. The subsequent module aggregates the set of embeddings {zk}k=1t\{\textbf{{z}}^{\prime}_{k}\}_{k=1}^{t} into a single video-level vector embedding zdt\textbf{{z}}^{\prime}\in\Re^{d^{\prime}\cdot t}, which is then fed to a Mask layer. The Mask layer dynamically selects embeddings based on the ’true’ frame count of the video, accounting for the original number of frames prior to padding. This step is crucial because the video-level annotations imply that all frames collectively, rather than individually, carry important information for an accurate prediction. The output of the Mask layer z′′dt\textbf{{z}}^{\prime\prime}\in\Re^{d^{\prime}\cdot t} is then mapped to another embedding z′′′d′′\textbf{{z}}^{\prime\prime\prime}\in\Re^{d^{\prime\prime}} using a feed forward layer. Finally, z′′′\textbf{{z}}^{\prime\prime\prime} is transformed into the logits u via a feed forward layer parameterized by W leading to the video-level classification: u=WTz′′′\textbf{{u}}=\textbf{{W}}^{T}\textbf{{z}}^{\prime\prime\prime}.

In the following, we further explain in more detail each component of our proposed method. Fig. 1 gives an overview of our proposed framework, MMA-MRNNet.

Refer to caption
Figure 1: Overview of the proposed MMA-MRNNet for dynamic multi-output Facial Expression Intensity Estimation. MMA-MRNNet comprises two main components: the Multiple Models of Affect (MMA) extractor, which generates affective representations (valence-arousal, basic expressions, and action units) for each video frame, and the Masked RNN and Routing (MRNN), which captures temporal dependencies and dynamically selects key features (and updates weights) according to the variable lengths of input videos.

3.1 MMA: Multiple Models of Affect extractor Component

The Multiple Models of Affect (MMA) extractor component processes an input video X by extracting affective representations from each frame using three distinct models of affect. Specifically, the MMA is a Multi-Task Learning (MTL) CNN model that concurrently performs: (i) continuous affect estimation in terms of valence and arousal (VA); (ii) recognition of 7 basic facial expressions; and (iii) detection of 17 action units (AUs). The architecture of the MMA, illustrated in Fig. 2, is structured around residual units, with ’bn’ indicating batch normalization layers. The model integrates the valence-arousal estimation, 7 basic expression recognition, and 17 AU detection tasks within the same embedding space derived from a shared feed-forward layer. Consequently, the output of the MMA when processing X is a multiple affect-level representation matrix Z={z1,,zt}26×t\textbf{{Z}}=\{\textbf{z}_{1},...,\textbf{z}_{t}\}\in\Re^{26\times t}.

Refer to caption
Figure 2: The Mulptiple Models of Affect extractor Component (MMA) that outputs for each frame the following emotional descriptors: valence and arousal, 17 action units and 7 basic expressions

For training the MMA, we utilize multiple in-the-wild datasets, including Aff-Wild2 [38, 40, 20, 36, 22, 37, 34, 28, 57, 18, 33, 41, 32, 42], AffectNet [49], and EmotioNet [5], which are annotated for valence-arousal, 7 basic expressions, and 17 action units (these action units are an aggregate in all datasets).

Our recent studies [15, 14] have highlighted challenges in the current evaluation of affect analysis methods, noting inconsistencies in database partitioning and evaluation practices that lead to biased and unfair comparisons. To address these issues, a unified protocol for database partitioning was proposed, ensuring fairness and comparability, while also accounting for subjects’ demographic information. It was demonstrated that methods previously considered state-of-the-art on original partitions may not retain their performance under this new protocol. Consequently, in this current paper, we adopt this updated partitioning protocol.

A key challenge in utilizing these datasets is the non-overlapping nature of their task-specific annotations. For instance, EmotioNet only includes AU annotations, lacking valence-arousal and 7 basic expression labels. Training the MMA directly with these datasets using a combined loss function for all tasks would result in noisy gradients and poor convergence, as not all loss terms would be consistently contributing to the overall objective function. This can lead to issues typical of Multi-Task Learning (MTL), such as task imbalance, where one task may dominate training, or negative transfer, where the MTL model underperforms compared to single-task models [34, 33, 32].

To address this issue, we generate AU pseudo-representations (rAUr_{AU}^{\prime}) from the 7 basic expression representations (rexprr_{expr}) produced by the MMA for each frame. This is achieved by leveraging the relationship between expressions and AUs, as defined in Table 1 of [8]. The study by [8] conducted a cognitive and psychological analysis of the associations between facial expressions and AU activations, summarizing the findings in a table that details the relatedness between expressions and their corresponding AUs. This table is presented in Table 1 for reference. Prototypical AUs are those consistently identified as activated by all annotators, while observational AUs are those marked as activated by only a subset of annotators.

Table 1: Relatedness of expressions & AUs inferred from [8]
Expression Prototypical AUs Observational AUs
happiness 12, 25 6
sadness 4, 15 1,6 , 11, 17
fear 1, 4, 20, 25 2, 5, 26
anger 4, 7, 24 10, 17, 23
surprise 1, 2, 25, 26 5
disgust 9, 10, 17 4, 24

The AU pseudo-representations are modeled as a mixture over the basic expression categories:

rAU=exprrexprrAU|exprr_{AU}^{\prime}=\sum\nolimits_{expr}r_{expr}\cdot r_{{AU}|{expr}} (1)

where rAU|exprr_{{AU}|{expr}} is defined deterministically from Table 1, and is 1 for prototypical or observational AUs, and 0 otherwise.

Then we match MMA’s AU representations (rAUr_{AU}) with the AU pseudo-representations by minimizing the binary cross entropy with soft targets loss:

DM=𝔼[i=117[rAU ilog rAUi]]\mathcal{L}_{DM}=\mathbb{E}\Bigg{[}\sum_{i=1}^{17}[-{r^{\prime}}_{AU}^{\text{ }i}\cdot\text{log }r_{AU}^{i}]\Bigg{]} (2)

With this loss we aim to infuse prior knowledge on task’s relationship (according to Table 1) into the network, so as to guide the generation of better and consistent expression and AU representations. For instance, if the network predicts happiness with probability 1 and also predicts that AUs 4, 15 and 1 are activated, this is a mistake as these AUs are associated with the expression sadness. In this case, the AU and expression representations are in conflict. Therefore the overall objective function (MMA\mathcal{L}_{MMA}) minimized during MMA’s training is:

MMA=CCC+CCE+BCE+DM\displaystyle\mathcal{L}_{MMA}=\mathcal{L}_{CCC}+\mathcal{L}_{CCE}+\mathcal{L}_{BCE}+\mathcal{L}_{DM} (3)

where:
CCC\mathcal{L}_{CCC} is the loss term associated with the valence-arousal estimation task, and CCC=10.5(ρa+ρv)\mathcal{L}_{CCC}=1-0.5\cdot(\rho_{a}+\rho_{v}), with ρa/v\rho_{a/v} being the Concordance Correlation Coefficient (CCC) of arousal/valence;
CCE\mathcal{L}_{CCE} is the categorical cross entropy loss associated with the 7 basic expression recognition task; and
BCE\mathcal{L}_{BCE} is the binary cross entropy loss associated with the 17 AU detection task.

3.2 MRNN: Masked RNN and Routing Component

As described in the previous section, the MMA component processes an input video X by extracting affective representations from each frame (xi\textbf{{x}}_{i}) using three distinct affective models. This results in an affect-level representation matrix Z={z1,,zt}26×t\textbf{{Z}}=\{\textbf{z}_{1},...,\textbf{z}_{t}\}\in\Re^{26\times t}. This matrix is then fed into an RNN positioned atop the MMA component, which captures temporal dependencies and sequential information across consecutive frames of the video. The RNN sequentially processes the extracted vector representations from frame 0 to frame tt, mapping these representations {zk}k=1K\{\textbf{z}_{k}\}^{K}_{k=1} to embeddings {zk}k=1K\{\textbf{z}^{\prime}_{k}\}^{K}_{k=1}, where each zkd\textbf{z}^{\prime}_{k}\in\Re^{d^{\prime}}.

These embeddings (corresponding to all video frames) are concatenated into a single vector embedding zdt\textbf{{z}}^{\prime}\in\Re^{d^{\prime}\cdot t}, aligning with the goal of estimating the intensity of various facial expressions across the entire video sequence, consistent with the provided annotations. This embedding z\textbf{{z}}^{\prime} is then passed through a Mask layer, producing new embedding z′′dt\textbf{{z}}^{\prime\prime}\in\Re^{d^{\prime}\cdot t}. The original (pre-padding) length ll of the input video is propagated to the Mask layer to guide the routing mechanism. During training, the routing mechanism dynamically selects elements from various positions within z\textbf{{z}}^{\prime} based on the video’s length ll, preserving the values of these selected elements and setting the remaining elements to zero. This process effectively routes only the relevant elements into the subsequent layer, thereby enhancing the model’s focus on key temporal features.

The embedding z′′dt\textbf{{z}}^{\prime\prime}\in\Re^{d^{\prime}\cdot t} is then transformed into another embedding z′′′d′′\textbf{{z}}^{\prime\prime\prime}\in\Re^{d^{\prime\prime}} through a feed forward layer, which is trained to extract high-level information from the ’masked’ embedding z′′\textbf{{z}}^{\prime\prime}. During training, only the weights connecting the feed-forward layer neurons to the elements within z\textbf{{z}}^{\prime} routed by the Mask layer are updated. The remaining weights are updated when their corresponding feed-forward layer neurons are connected to elements within z\textbf{{z}}^{\prime} that are selected by the Mask layer in a different video input. The loss function minimization is conducted similarly to networks with dynamic routing, where the weights not involved in the routing process remain constant, and links corresponding to non-routed elements within z\textbf{{z}}^{\prime} are ignored. Finally, the embedding z′′′\textbf{{z}}^{\prime\prime\prime} is mappes to the logits u via a feed forward layer parameterized by W, resulting in the video-level classification, expressed as: u=WTz′′′\textbf{{u}}=\textbf{{W}}^{T}\textbf{{z}}^{\prime\prime\prime}.

The loss function that we utilized for training MMA-MRNNet: was not the typical Mean Squared Error (MSE) but a loss based on the pearson correlation since that correlation metric was the evaluation criterion for the utilized database:

total=1i=17ρi7=117i=17si,xysi,xsi,y\mathcal{L}_{total}=1-\sum_{i=1}^{7}\frac{\rho_{i}}{7}=1-\frac{1}{7}\sum_{i=1}^{7}\frac{s_{i,xy}}{\sqrt{s_{i,x}\cdot s_{i,y}}} (4)

where: ii denotes the facial expression; ρi\rho_{i} is the pearson correlation coefficient; si,xs_{i,x} and si,ys_{i,y} are the variances of the expression labels and predicted values; si,xys_{i,xy} is their covariance.

3.3 Datasets, Pre-Processing and Implementation Details

The Hume-Reaction dataset was used as part of both the Emotional Reactions Sub-Challenge of MuSe 2022 [6] and the Emotional Reaction Intensity Estimation Challenge of the 5th ABAW Competition in 2023 [40, 18, 41, 42, 33, 27, 29, 30, 19, 20, 35, 25, 21, 36, 31, 22, 37, 34, 26, 1, 23]. The participants of this subchallenge explore a multi-output regression task, utilizing seven, self-annotated, nuanced classes of emotion: ‘Adoration,’ ‘Amusement,’ ‘Anxiety,’ ‘Disgust,’ ‘Empathic-Pain,’ ‘Fear,’ and ‘Surprise.’ The dataset consists of 25,067 videos taken from 2,222 subjects of which 15,806 constitute the training set, 4,657 the validation set and 4,604 the test set.

The Aff-Wild2 database [38, 40, 20, 36, 22, 37, 34, 28, 57, 18, 33, 41, 32, 42, 43] is the largest in-the-wild database and the only one to be annotated in a per-frame basis for the seven basic expressions (i.e., happiness, surprise, anger, disgust, fear, sadness and the neutral state), twelve action units (AUs 1,2,4,6,7,10,12,15,23,24,25, 26) and valence and arousal. In total, it consists of 564 videos of around 2.8M frames with 554 subjects. Aff-Wild2 displayes a big diversity in terms of subjects’ ages, ethnicities and nationalities; it has also great variations and diversities of environments.

The AffectNet dataset [49] contains around 1M facial images, 300K of which were manually annotated in terms of 7 discrete expressions (plus contempt) and valence-arousal. The original training set of this database consists of around 290K images and the original validation of 4K. We evaluate our method on the updated partitioning protocol of this database according to our previous work [15, 14] (as we mentioned in Section 3.1). This new partitioning consists of a training set of around 160K images, a validation set of around 45K images and a test set of around 90K images.

The EmotioNet database [5] contains around 1M images and was released for the EmotioNet Challenge in 2017. 950K images were automatically annotated and the remaining 50K images were manually annotated with 11 AUs (1,2,4,5,6,9,12,17,20,25,26); around half of the latter constituted the validation and the other half the test set of the Challenge. We evaluate our method on the updated partitioning protocol of this database according to our previous work [15, 14] (as we mentioned in Section 3.1). This new partitioning consists of a training set of around 25K images, a validation set of around 7K images and a test set of around 14K images.

We used the RetinaFace detector [7] to extract, from all images, face bounding boxes and 5 facial landmarks; the latter were used for face alignment. All cropped and aligned images were resized to 112×112×3112\times 112\times 3 pixel resolution and their intensity values were normalized to [1,1][-1,1].

We chose batch size equal to 4, length tt equal to 480, Adam optimizer with learning rate 10410^{-4} when training from scratch and 10510^{-5} when training in an end-to-end manner, after having initialised each subnetwork. For RNN we utilize 1-layer GRU with 128 units; feed forward layer consists of 32 units. Training was performed on a Tesla V100 32GB GPU; training time was 3 days. The TensorFlow platform has been used.

4 Experimental Results

4.1 Comparison with the state-of-the-art

At first we compare the performance of MMA-MRNNet to that of various baseline [6] and state-of-the-art methods: ViPER and Netease Fuxi Virtual Human methods (which are multi-modal methods exploiting audio, visual and text information); the best performing HFUT-CVers method (presented in the related work section; it is an ensemble multi-modal method exploiting both audio and visual information); USTC-IAT-United method (which was presented in the related work section and is a multi-modal method exploiting both audio and visual information); USTC-AC and NISL-2023 methods (both presented in the related work section; they are ensemble multi-modal methods exploiting both audio and visual information). Table 2 shows that our uni-modal non-ensemble learning MMA-MRNNet (that exploits only the visual information and does not employ any ensemble learning) outperforms all other methods by large margins (although some methods are multimodal ones or even ensembles). Let us also note that all baseline and state-of-the-art methods utilized the ad-hoc strategy of selecting fixed input length by removing or duplicating images within each video sequence.

Table 2: Comparison between MMA-MRNNet, baselines and the state-of-the-art on the test set of Hume-Reaction dataset; Pearson’s Correlation Coefficient results are denoted in %
Methods Pearson’s Correlation Coefficient (ρ\rho)
HFUT-CVers [44] 47.3
USTC-IAT-United [56] 43.8
Netease Fuxi Virtual Human [51] 40.5
USTC-AC [54] 37.3
NISL-2023 [59] 36.7
ViPER [53] 29.7
FAU-Baseline [6] 28.0
VGGface 2-Baseline [6] 18.3
Fusion-Baseline [6] 20.30
MMA-MRNNet 53.1

4.2 Ablation Study

We conducted a series of ablation experiments to evaluate the impact of different elements and components on our model’s performance.

Initially, we used only single-task affective representations (extracted from MMA) as input to the RNN. We then tested combinations of two tasks (e.g., VA & AUs), and finally, we utilized the affective representations from all three tasks concurrently. The results are summarized in Table 3, where we present only the best performance for each experiment to avoid cluttering of the results. Notably, even when using only valence and arousal representations, our network outperformed all other methods except HFUT-CVers. The model’s performance improved substantially when incorporating additional per-frame features, such as the 7 basic expressions or 17 AUs. In the two-task experiments, we observed a further increase in the Pearson’s correlation coefficient ranging from 1% to 1.5%. Ultimately, when all three tasks were used together, our method achieved the highest performance.

Table 3: Ablation Results on MMA-MRNNet on the validation set of Hume-Reaction dataset; Pearson’s Correlation Coefficient results are denoted in %
Affective Representation
from the MMA component
Pearson’s Correlation Coefficient (ρ\rho)
VA 51.1
7 Basic Expressions 52.6
17 AUs 53.0
VA & 7 Basic Expressions 53.5
VA & 17 AUs 54.1
17 AUs & 7 Basic Expressions 54.1
VA & 7 Basic Expressions & 17 AUs 54.4

To identify the optimal architecture for our network, we conducted experiments with various configurations, including different CNNs (e.g., ResNet50 instead of MMA) and RNNs (e.g., LSTM instead of GRU), as well as varying the number of layers and units, as detailed in Table 4. After evaluating a wide range of combinations, we determined that the most effective configuration consists of a single GRU layer with 128 units, followed by a feed forward layer with 32 units. Additionally, we evaluated the impact of incorporating the Mask layer, dynamic routing, and our proposed loss function (as an alternative to the conventional MSE). The results presented in Table 3 demonstrate that these components significantly enhance the performance of MMA-MRNNet.

Table 4: Further Ablation Results on MMA-MRNNet on the validation set of Hume- Reaction dataset; Pearson’s Correlation Coefficient results are denoted in %
Model Pearson’s Correlation Coefficient (ρ\rho)
MMA + GRU + FC (64) 53.0
MMA + GRU + FC (16) 53.5
MMA + GRU + FC (8) 52.8
MMA + 2 ×\times GRU + FC (32) 54.2
ResNet50 + GRU + FC (32) 51.5
MMA + LSTM + FC (32) 53.2
MMA + GRU (256) + FC (32) 53.4
MMA + GRU (64) + FC (32) 53.8
MMA-MRNNet w/o Mask & Routing 51.9
MMA-MRNNet with MSE 53.1
MMA-MRNNet 54.4

4.3 MMA Evaluation Results

Here we provide an extensive experimental study in which we utilise the top-performing methods from the ABAW Competitions (FUXI [58], SITU [45], CTC [61]) and the state-of-the-art methods (DACL [11], DAN [55], POSTER++ [47]; ME-GraphAU [46] & AUNets [52]) for 7 basic expression recognition, AU detection and valence-arousal estimation, and compared their performance to our proposed MMA component.

As can be seen on Table 5, our proposed MMA component outperformed all these methods on all tasks (7 basic expression recognition, AU detection and valence-arousal estimation) and on all utilized databases (Aff-Wild2, AffectNet and EmotioNet) by large margins.

Table 5: Performance comparison (in %) between the MMA component and various state-of-the-art methods. ’CCC-VA’ represents the average Concordance Correlation Coefficient (CCC) for valence and arousal. ’F1 - Expr’ refers to the average (i.e., macro) F1 score across the 7 basic expressions, while ’F1 - AUs’ denotes the average F1 score across all AUs present in each of the databases used.
Databases
Aff-Wild2
AffectNet
EmotioNet
Methods CCC-VA F1 - Exprs F1 - AUs CCC-VA F1 - Exprs F1 - AUs
SITU
64.14 38.24 54.22 71.1 59.0 77.2
CTC
56.66 33.11 48.87 71.0 57.7 74.6
FUXI
63.72 39.21 55.49 74.0 63.1 77.9
ME-GraphAU
- - - - - 72.9
AUNets
- - - - - 82.8
DAN
- - - - 60.0 -
DACL
- - - - 60.3 -
POSTER++
- - - - 63.2 -
MMA
67.38 43.21 58.87 78.2 65.4 85.4

The performance of the proposed MMA method was evaluated against several state-of-the-art approaches across multiple datasets (Aff-Wild2, AffectNet and EmotioNet), as detailed in Table 5. The MMA component consistently outperformed by large margins all methods on all tasks (7 basic expression recognition, AU detection and valence-arousal estimation) and on all utilized databases, across all evaluation metrics. Specifically, on the Aff-Wild2 dataset, MMA surpassed the closest competitor, SITU, for valence-arousal estimation, by 3.24%. It also outperformed the closest competitor, FUXI, for 7 basic expression recognition by 4%, as well as for AU detection by 3.38%. On the AffectNet dataset, MMA again demonstrated superior performance, outperforming the sota methods by at least 4.2% for valence-arousal estimation and by at least 2.2% for 7 basic expression recognition. On the EmotioNet dataset, MMA outperformed the sota methods by at least 2.6%. These results underscore the robustness and superiority of MMA in delivering precise and reliable affect representations.

5 Conclusion

In this paper, we introduced MMA-MRNNet, a novel deep learning architecture for dynamic multi-output Facial Expression Intensity Estimation (FEIE) from video data. Our method addresses the limitations of traditional approaches by leveraging a Multi-Task Learning (MTL) framework to extract rich affective representations, including valence-arousal, basic facial expressions, and action units (AUs). These representations are further refined through a Masked Routed RNN (MRNN), which dynamically adjusts to the variable lengths of input videos, ensuring robust and accurate predictions.

We demonstrated the effectiveness of MMA-MRNNet on the Hume-Reaction dataset, where it consistently outperformed by large margins all state-of-the-art methods. We also demonstrated the effectiveness of the MMA component across multiple in-the-wild datasets, where it consistently outperformed all state-of-the-art methods across various metrics. Our approach not only handles the complexities of video-level annotation but also mitigates the challenges associated with processing variable-length sequences, offering a flexible and powerful solution for real-world applications in affective computing.

References

  • [1] Arsenos, A., Davidhi, A., Kollias, D., Prassopoulos, P., Kollias, S.: Data-driven covid-19 detection through medical imaging. In: 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW). pp. 1–5. IEEE (2023)
  • [2] Arsenos, A., Karampinis, V., Petrongonas, E., Skliros, C., Kollias, D., Kollias, S., Voulodimos, A.: Common corruptions for evaluating and enhancing robustness in air-to-air visual object detection. IEEE Robotics and Automation Letters (2024)
  • [3] Arsenos, A., Kollias, D., Petrongonas, E., Skliros, C., Kollias, S.: Uncertainty-guided contrastive learning for single source domain generalisation. In: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 6935–6939. IEEE (2024)
  • [4] Arsenos, A., Petrongonas, E., Filippopoulos, O., Skliros, C., Kollias, D., Kollias, S.: Nefeli: A deep-learning detection and tracking pipeline for enhancing autonomy in advanced air mobility. Available at SSRN 4674579
  • [5] Benitez-Quiroz, C., Srinivasan, R., Martinez, A.: Emotionet: An accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild. In: Proceedings of IEEE International Conference on Computer Vision & Pattern Recognition (CVPR’16). Las Vegas, NV, USA (June 2016)
  • [6] Christ, L., Amiriparian, S., Baird, A., Tzirakis, P., Kathan, A., Müller, N., Stappen, L., Meßner, E.M., König, A., Cowen, A., Cambria, E., Schuller, B.W.: The muse 2022 multimodal sentiment analysis challenge: Humor, emotional reactions, and stress. In: Proceedings of the 3rd Multimodal Sentiment Analysis Challenge. Association for Computing Machinery, Lisbon, Portugal (2022), workshop held at ACM Multimedia 2022, to appear
  • [7] Deng, J., Guo, J., Ververas, E., Kotsia, I., Zafeiriou, S.: Retinaface: Single-shot multi-level face localisation in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5203–5212 (2020)
  • [8] Du, S., Tao, Y., Martinez, A.M.: Compound facial expressions of emotion. Proceedings of the National Academy of Sciences 111(15), E1454–E1462 (2014)
  • [9] Ekman, P.: Facial action coding system (facs). A human face (2002)
  • [10] Ekman, P., Friesen, W.V.: Facial action coding system. Environmental Psychology & Nonverbal Behavior (1978)
  • [11] Farzaneh, A.H., Qi, X.: Facial expression recognition in the wild via deep attentive center loss. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 2402–2411 (2021)
  • [12] Gerogiannis, D., Arsenos, A., Kollias, D., Nikitopoulos, D., Kollias, S.: Covid-19 computer-aided diagnosis through ai-assisted ct imaging analysis: Deploying a medical ai system. arXiv preprint arXiv:2403.06242 (2024)
  • [13] Guo, X., Polanía, L.F., Barner, K.E.: Audio-video emotion recognition in the wild using deep hybrid networks (2020). https://doi.org/10.48550/ARXIV.2002.09023, https://arxiv.org/abs/2002.09023
  • [14] Hu, G., Kollias, D., Papadopoulou, E., Tzouveli, P., Wei, J., Yang, X.: Rethinking affect analysis: A protocol for ensuring fairness and consistency. arXiv preprint arXiv:2408.02164 (2024)
  • [15] Hu, G., Papadopoulou, E., Kollias, D., Tzouveli, P., Wei, J., Yang, X.: Bridging the gap: Protocol towards fair and consistent affect analysis. arXiv preprint arXiv:2405.06841 (2024)
  • [16] Hu, P., Cai, D., Wang, S., Yao, A., Chen, Y.: Learning supervised scoring ensemble for emotion recognition in the wild. In: Proceedings of the 19th ACM International Conference on Multimodal Interaction. p. 553–560. ICMI ’17, Association for Computing Machinery, New York, NY, USA (2017). https://doi.org/10.1145/3136755.3143009, https://doi.org/10.1145/3136755.3143009
  • [17] Karampinis, V., Arsenos, A., Filippopoulos, O., Petrongonas, E., Skliros, C., Kollias, D., Kollias, S., Voulodimos, A.: Ensuring uav safety: A vision-only and real-time framework for collision avoidance through object detection, tracking, and distance estimation. arXiv preprint arXiv:2405.06749 (2024)
  • [18] Kollias, D., Schulc, A., Hajiyev, E., Zafeiriou, S.: Analysing affective behavior in the first abaw 2020 competition. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020)(FG). pp. 794–800
  • [19] Kollias, D.: Abaw: learning from synthetic data & multi-task learning challenges. In: European Conference on Computer Vision. pp. 157–172. Springer (2022)
  • [20] Kollias, D.: Abaw: Valence-arousal estimation, expression recognition, action unit detection & multi-task learning challenges. arXiv preprint arXiv:2202.10659 (2022)
  • [21] Kollias, D.: Abaw: Learning from synthetic data & multi-task learning challenges. In: European Conference on Computer Vision. pp. 157–172. Springer (2023)
  • [22] Kollias, D.: Multi-label compound expression recognition: C-expr database & network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5589–5598 (2023)
  • [23] Kollias, D., Arsenos, A., Kollias, S.: Ai-enabled analysis of 3-d ct scans for diagnosis of covid-19 & its severity. In: 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW). pp. 1–5. IEEE (2023)
  • [24] Kollias, D., Arsenos, A., Kollias, S.: A deep neural architecture for harmonizing 3-d input data analysis and decision making in medical imaging. Neurocomputing 542, 126244 (2023)
  • [25] Kollias, D., Arsenos, A., Kollias, S.: A deep neural architecture for harmonizing 3-d input data analysis and decision making in medical imaging. Neurocomputing 542, 126244 (2023)
  • [26] Kollias, D., Arsenos, A., Kollias, S.: Domain adaptation, explainability & fairness in ai for medical image analysis: Diagnosis of covid-19 based on 3-d chest ct-scans. arXiv preprint arXiv:2403.02192 (2024)
  • [27] Kollias, D., Cheng, S., Ververas, E., Kotsia, I., Zafeiriou, S.: Deep neural network augmentation: Generating faces for affect analysis. International Journal of Computer Vision 128(5), 1455–1484 (2020)
  • [28] Kollias, D., Cheng, S., Ververas, E., Kotsia, I., Zafeiriou, S.: Deep neural network augmentation: Generating faces for affect analysis. International Journal of Computer Vision pp. 1–30 (2020)
  • [29] Kollias, D., Nicolaou, M.A., Kotsia, I., Zhao, G., Zafeiriou, S.: Recognition of affect in the wild using deep neural networks. In: Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on. pp. 1972–1979. IEEE (2017)
  • [30] Kollias, D., Nicolaou, M.A., Kotsia, I., Zhao, G., Zafeiriou, S.: Recognition of affect in the wild using deep neural networks. In: Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on. pp. 1972–1979. IEEE (2017)
  • [31] Kollias, D., Psaroudakis, A., Arsenos, A., Theofilou, P.: Facernet: a facial expression intensity estimation network. arXiv preprint arXiv:2303.00180 (2023)
  • [32] Kollias, D., Sharmanska, V., Zafeiriou, S.: Face behavior a la carte: Expressions, affect and action units in a single network. arXiv preprint arXiv:1910.11111 (2019)
  • [33] Kollias, D., Sharmanska, V., Zafeiriou, S.: Distribution matching for heterogeneous multi-task learning: a large-scale face study. arXiv preprint arXiv:2105.03790 (2021)
  • [34] Kollias, D., Sharmanska, V., Zafeiriou, S.: Distribution matching for multi-task learning of classification tasks: a large-scale study on faces & beyond. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 2813–2821 (2024)
  • [35] Kollias, D., Tzirakis, P., Baird, A., Cowen, A., Zafeiriou, S.: Abaw: Valence-arousal estimation, expression recognition, action unit detection & emotional reaction intensity estimation challenges. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5888–5897 (2023)
  • [36] Kollias, D., Tzirakis, P., Baird, A., Cowen, A., Zafeiriou, S.: Abaw: Valence-arousal estimation, expression recognition, action unit detection & emotional reaction intensity estimation challenges. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5888–5897 (2023)
  • [37] Kollias, D., Tzirakis, P., Cowen, A., Zafeiriou, S., Shao, C., Hu, G.: The 6th affective behavior analysis in-the-wild (abaw) competition. arXiv preprint arXiv:2402.19344 (2024)
  • [38] Kollias, D., Tzirakis, P., Nicolaou, M.A., Papaioannou, A., Zhao, G., Schuller, B., Kotsia, I., Zafeiriou, S.: Deep affect prediction in-the-wild: Aff-wild database and challenge, deep architectures, and beyond. International Journal of Computer Vision pp. 1–23 (2019)
  • [39] Kollias, D., Vendal, K., Gadhavi, P., Russom, S.: Btdnet: A multi-modal approach for brain tumor radiogenomic classification. Applied Sciences 13(21), 11984 (2023)
  • [40] Kollias, D., Zafeiriou, S.: Expression, affect, action unit recognition: Aff-wild2, multi-task learning and arcface. arXiv preprint arXiv:1910.04855 (2019)
  • [41] Kollias, D., Zafeiriou, S.: Affect analysis in-the-wild: Valence-arousal, expressions, action units and a unified framework. arXiv preprint arXiv:2103.15792 (2021)
  • [42] Kollias, D., Zafeiriou, S.: Analysing affective behavior in the second abaw2 competition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3652–3660 (2021)
  • [43] Kollias, D., Zafeiriou, S., Kotsia, I., Dhall, A., Ghosh, S., Shao, C., Hu, G.: 7th abaw competition: Multi-task learning and compound expression recognition. arXiv preprint arXiv:2407.03835 (2024)
  • [44] Li, J., Chen, Y., Zhang, X., Nie, J., Yu, Y., Li, Z., Wang, M., Hong, R.: Multimodal feature extraction and fusion for emotional reaction intensity estimation and expression classification in videos with transformers. arXiv preprint arXiv:2303.09164 (2023)
  • [45] Liu, C., Zhang, X., Liu, X., Zhang, T., Meng, L., Liu, Y., Deng, Y., Jiang, W.: Facial expression recognition based on multi-modal features for videos in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5871–5878 (2023)
  • [46] Luo, C., Song, S., Xie, W., Shen, L., Gunes, H.: Learning multi-dimensional edge feature-based au relation graph for facial action unit recognition. In: Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence. pp. 1239–1246 (2022)
  • [47] Mao, J., Xu, R., Yin, X., Chang, Y., Nie, B., Huang, A.: Poster v2: A simpler and stronger facial expression recognition network. arXiv preprint arXiv:2301.12149 (2023)
  • [48] Miah, H., Kollias, D., Pedone, G.L., Provan, D., Chen, F.: Can machine learning assist in diagnosis of primary immune thrombocytopenia? a feasibility study. arXiv preprint arXiv:2405.20562 (2024)
  • [49] Mollahosseini, A., Hasani, B., Mahoor, M.H.: Affectnet: A database for facial expression, valence, and arousal computing in the wild. arXiv preprint arXiv:1708.03985 (2017)
  • [50] Plutchik, R.: A psychoevolutionary theory of emotions (1982)
  • [51] Qiu, F., Ma, B., Zhang, W., Ding, Y.: Multi-modal emotion reaction intensity estimation with temporal augmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5777–5784 (2023)
  • [52] Romero, A., León, J., Arbeláez, P.: Multi-view dynamic facial action unit detection. Image and Vision Computing (2018)
  • [53] Vaiani, L., La Quatra, M., Cagliero, L., Garza, P.: Viper: Video-based perceiver for emotion recognition. In: Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge. pp. 67–73 (2022)
  • [54] Wang, S., Wu, J., Zheng, F., Li, X., Li, X., Wang, S., Wu, Y., Chang, Y., Miao, X.: Emotional reaction intensity estimation based on multimodal data. arXiv preprint arXiv:2303.09167 (2023)
  • [55] Wen, Z., Lin, W., Wang, T., Xu, G.: Distract your attention: Multi-head cross attention network for facial expression recognition. Biomimetics 8(2),  199 (2023)
  • [56] Yu, J., Cai, Z., Li, R., Zhao, G., Xie, G., Zhu, J., Zhu, W.: Exploring large-scale unlabeled faces to enhance facial expression recognition. arXiv preprint arXiv:2303.08617 (2023)
  • [57] Zafeiriou, S., Kollias, D., Nicolaou, M.A., Papaioannou, A., Zhao, G., Kotsia, I.: Aff-wild: Valence and arousal ‘in-the-wild’challenge. In: Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on. pp. 1980–1987. IEEE (2017)
  • [58] Zhang, W., Ma, B., Qiu, F., Ding, Y.: Multi-modal facial affective analysis based on masked autoencoder. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5792–5801 (2023)
  • [59] Zhang, Z., An, L., Cui, Z., Dong, T., et al.: Facial affect recognition based on transformer encoder and audiovisual fusion for the abaw5 challenge. arXiv preprint arXiv:2303.09158 (2023)
  • [60] Zhao, S., Ma, Y., Gu, Y., Yang, J., Xing, T., Xu, P., Hu, R., Chai, H., Keutzer, K.: An end-to-end visual-audio attention network for emotion recognition in user-generated videos (2020). https://doi.org/10.48550/ARXIV.2003.00832, https://arxiv.org/abs/2003.00832
  • [61] Zhou, W., Lu, J., Xiong, Z., Wang, W.: Leveraging tcn and transformer for effective visual-audio fusion in continuous emotion recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5756–5763 (2023)