ABAW: Valence-Arousal Estimation, Expression Recognition, Action Unit Detection & Multi-Task Learning Challenges

Dimitrios Kollias
Queen Mary University of London, UK
d.kollias@qmul.co.uk

Abstract

This paper describes the third Affective Behavior Analysis in-the-wild (ABAW) Competition, held in conjunction with IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2022. The 3rd ABAW Competition is a continuation of the Competitions held at ICCV 2021, IEEE FG 2020 and IEEE CVPR 2017 Conferences, and aims at automatically analyzing affect. This year the Competition encompasses four Challenges: i) uni-task Valence-Arousal Estimation, ii) uni-task Expression Classification, iii) uni-task Action Unit Detection, and iv) Multi-Task-Learning. All the Challenges are based on a common benchmark database, Aff-Wild2, which is a large scale in-the-wild database and the first one to be annotated in terms of valence-arousal, expressions and action units. In this paper, we present the four Challenges, with the utilized Competition corpora, we outline the evaluation metrics and present the baseline systems along with their obtained results.

1 Introduction

Affect recognition based on a subject’s facial expressions has been a topic of major research in the attempt to generate machines that can understand the way subjects feel, act and react. The problem of affect analysis and recognition constitutes a key issue in behavioural modelling, human computer/machine interaction and affective computing. There are a number of related applications spread across a variety of fields, such as medicine, health, or driver fatigue, monitoring, e-learning, marketing, entertainment, lie detection and law [1, 22, 38, 18, 53].

In the past, due to the unavailability of large amounts of data captured in real-life situations, research has mainly focused on controlled environments. However, recently, social media and platforms have been widely used and large amount of data have become available. Moreover, deep learning has emerged as a means to solve visual analysis and recognition problems. Thus, major research has been given during the last few years to the development and use of deep learning techniques and deep neural networks [34, 17] in various applications, including affect recognition in-the-wild, i.e., in unconstrained environments. Moreover, apart from affect analysis and recognition, generation of facial affect is of great significance, in many real life applications, such as for synthesis of affect on avatars that interact with humans, in computer games, in augmented and virtual environments, in educational and learning contexts [45, 46, 41, 42]. The ABAW Workshop exploits these advances and makes significant contributions for affect analysis, recognition and synthesis in-the-wild.

Ekman [13] was the first to systematically study human facial expressions. His study categorizes the prototypical facial expressions, apart from neutral expression, into six classes representing anger, disgust, fear, happiness, sadness and surprise. Furthermore, facial expressions are related to specific movements of facial muscles, called Action Units (AUs). The Facial Action Coding System (FACS) was developed, in which facial changes are described in terms of AUs [6].

Apart from the above categorical definition of facial expressions and related emotions, in the last few years there has been great interest in dimensional emotion representations, which are of great interest in human computer interaction and human behaviour analysis. Dimensional emotion representations are used to tag emotional states in continuous mode, usually in terms of the arousal and valence dimensions, i.e. in terms of how active or passive, positive or negative is the human behaviour under analysis [14, 49, 43].

The third ABAW Competition, to be held in conjunction with the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2022 is a continuation of the first [24] and second [32] ABAW Competitions held in conjunction with the IEEE Conference on Face and Gesture Recognition (IEEE FG) 2021 and with the International Conference on Computer Vision (ICCV) 2022, respectively, which targeted dimensional (in terms of valence and arousal) [8, 35, 56, 11, 4, 3, 9, 55, 47, 48, 54, 50, 21, 2, 39], categorical (in terms of the basic expressions) [33, 15, 12, 51, 36, 16, 37] and facial action unit analysis and recognition [40, 20, 26, 19, 25, 7, 44, 47]. The third ABAW Competition contains four Challenges, which are based on the same in-the-wild database, (i) the uni-task Valence-Arousal Estimation Challenge; (ii) the uni-task Expression Classification Challenge (for the 6 basic expressions plus the neutral state plus the ’other’ category that denotes expressions/affective states other than the 6 basic ones); (iii) the uni-task Action Unit Detection Challenge (for 12 action units); (iv) the Multi-Task Learning Challenge (for joint learning and predicting of valence-arousal, 8 expressions -6 basic plus neutral plus ’other’- and 12 action units). These Challenges produce a significant step forward when compared to previous events. In particular, they use the Aff-Wild2 [32, 24, 30, 31, 28, 29, 26, 25, 27, 52, 23], the first comprehensive benchmark for all three affect recognition tasks in-the-wild: the Aff-Wild2 database extends the Aff-Wild [27, 52, 23], with more videos and annotations for all behavior tasks.

The remainder of this paper is organised as follows. We introduce the Competition corpora in Section 2, the Competition evaluation metrics in Section 3, the developed baseline, along with the obtained results in Section 4, before concluding in Section 5.

2 Competition Corpora

The third Affective Behavior Analysis in-the-wild (ABAW2) Competition relies on the Aff-Wild2 database, which is the first ever database annotated for all three main behavior tasks: valence-arousal estimation, action unit detection and expression classification. These three tasks constitute the basis of the four Challenges.

The Aff-Wild2 database, in all Challenges, is split into training, validation and test set. At first the training and validation sets, along with their corresponding annotations, are being made public to the participants, so that they can develop their own methodologies and test them. The training and validation data contain the videos and their corresponding annotation. Furthermore, to facilitate training, especially for people that do not have access to face detectors/tracking algorithms, we provide bounding boxes and landmarks for the face(s) in the videos (we also provide the aligned faces). At a later stage, the test set without annotations will be given to the participants. Again, we will provide bounding boxes and landmarks for the face(s) in the videos (we will also provide the aligned faces).

In the following, we provide a short overview of each Challenge’s dataset and refer the reader to the original work for a more complete description. Finally, we describe the pre-processing steps that we carried out for cropping and aligning the images of Aff-Wild2. The cropped and aligned images have been utilized in our baseline experiments.

2.1 Valence-Arousal Estimation Challenge

This Challenge’s corpora include $567$ videos in Aff-Wild2 that contain annotations in terms of valence and arousal. Sixteen of these videos display two subjects, both of which have been annotated. In total, $2,816,832$ frames, with $455$ subjects, $277$ of which are male and $178$ female, have been annotated by four experts using the method proposed in [5]. Valence and arousal values range continuously in $[-1,1]$ .

2.2 Expression Recognition Challenge

This Challenge’s corpora include $548$ videos in Aff-Wild2 that contain annotations in terms of the the 6 basic expressions, plus the neutral state, plus a category ’other’ that denotes expressions/affective states other than the 6 basic ones. Seven of these videos display two subjects, both of which have been annotated. In total, $2,603,921$ frames, with $431$ subjects, $265$ of which are male and $166$ female, have been annotated by seven experts in a frame-by-frame basis.

2.3 Action Unit Detection Challenge

This Challenge’s corpora include 547 videos that contain annotations in terms of 12 AUs, namely AU1, AU2, AU4, AU6, AU7, AU10, AU12, AU15, AU23, AU24, AU25 and AU26. Seven of these videos display two subjects, both of which have been annotated. In total, $2,603,921$ frames, with $431$ subjects, $265$ of which are male and $166$ female, have been annotated in a semi-automatic procedure (that involves manual and automatic annotations). The annotation has been performed in a frame-by-frame basis. Table 1 shows the name of the twelve action units that have been annotated and the action that they are associated with.

Table 1: Distribution of AU annotations in Aff-Wild2

Action Unit #	Action
AU 1	inner brow raiser
AU 2	outer brow raiser
AU 4	brow lowerer
AU 6	cheek raiser
AU 7	lid tightener
AU 10	upper lip raiser
AU 12	lip corner puller
AU 15	lip corner depressor
AU 23	lip tightener
AU 24	lip pressor
AU 25	lips part
AU 26	jaw drop

2.4 Multi-Task-Learning Challenge

For this Challenge’s corpora, we have created a static version of the Aff-Wild2 database, named s-Aff-Wild2. s-Aff-Wild2 contains selected-specific frames/images from Aff-Wild2. In total, 172,360 images are used that contain annotations in terms of valence-arousal; 6 basic expressions, plus the neutral state, plus the ’other’ category; 12 action units (as described in the previous subsections).

2.5 Aff-Wild2 Pre-Processing: Cropped & Cropped-Aligned Images

At first, all videos are splitted into independent frames. Then they are passed through the RetinaFace detector [10] so as to extract, for each frame, face bounding boxes and 5 facial landmarks. The images were cropped according the bounding box locations; then the images were provided to the participating teams. The 5 facial landmarks (two eyes, nose and two mouth corners) were used to perform similarity transformation. The resulting cropped and aligned images were additionally provided to the participating teams. Finally, the cropped and aligned images were utilized in our baseline experiments, described in Section 4.

All cropped and cropped-aligned images were resized to $112\times 112\times 3$ pixel resolution and their intensity values were normalized to $[-1,1]$ .

3 Evaluation Metrics Per Challenge

Next, we present the metrics that will be used for assessing the performance of the developed methodologies of the participating teams in each Challenge.

3.1 Valence-Arousal Estimation Challenge

The performance measure is the average between the Concordance Correlation Coefficient (CCC) of valence and arousal. CCC evaluates the agreement between two time series (e.g., all video annotations and predictions) by scaling their correlation coefficient with their mean square difference. CCC takes values in the range $[-1,1]$ ; high values are desired. CCC is defined as follows:

\rho_{c}=\frac{2s_{xy}}{s_{x}^{2}+s_{y}^{2}+(\bar{x}-\bar{y})^{2}},

(1)

where $s_{x}$ and $s_{y}$ are the variances of all video valence/arousal annotations and predicted values, respectively, $\bar{x}$ and $\bar{y}$ are their corresponding mean values and $s_{xy}$ is the corresponding covariance value.

Therefore, the evaluation criterion for the Valence-Arousal Estimation Challenge is:

\mathcal{P}_{VA}=\frac{\rho_{a}+\rho_{v}}{2},

(2)

3.2 Expression Recognition Challenge

The performance measure is the average F1 Score across all 8 categories (i.e., macro F1 Score). The $F_{1}$ score is a weighted average of the recall (i.e., the ability of the classifier to find all the positive samples) and precision (i.e., the ability of the classifier not to label as positive a sample that is negative). The $F_{1}$ score takes values in the range $[0,1]$ ; high values are desired. The $F_{1}$ score is defined as:

F_{1}=\frac{2\times precision\times recall}{precision+recall}

(3)

Therefore, the evaluation criterion for the Expression Recognition Challenge is:

\mathcal{P}_{EXPR}=\frac{\sum_{expr}F_{1}^{expr}}{8}

(4)

3.3 Action Unit Detection Challenge

The performance measure is the average F1 Score across all 12 AUs (i.e., macro F1 Score). Therefore, the evaluation criterion for the Action Unit Detection Challenge is:

\mathcal{P}_{AU}=\frac{\sum_{au}F_{1}^{au}}{12}

(5)

3.4 Multi-Task-Learning Challenge

The performance measure is the sum of: the average CCC of valence and arousal; the average F1 Score of the 8 expression categories; the average F1 Score of the 12 action units (as defined above). Therefore, the evaluation criterion for the Multi-Task-Learning Challenge is:

	$\displaystyle\mathcal{P}_{MTL}$	$\displaystyle=\mathcal{P}_{VA}+\mathcal{P}_{EXPR}+\mathcal{P}_{AU}$
		$\displaystyle=\frac{\rho_{a}+\rho_{v}}{2}+\frac{\sum_{expr}F_{1}^{expr}}{8}+\frac{\sum_{au}F_{1}^{au}}{12}$		(6)

4 Baseline Networks and Results

All baseline systems rely exclusively on existing open-source machine learning toolkits to ensure the reproducibility of the results. All systems have been implemented in TensorFlow; training time was around six hours on a Titan X GPU, with a learning rate of $10^{-4}$ and with a batch size of 256.

In this Section, we first describe the baseline systems developed for each Challenge and then we report their obtained results.

Table 2: Valence-Arousal Challenge Results on Aff-Wild2’s validation set; evaluation criterion is the mean CCC of valence-arousal

Baseline

CCC-Valence

CCC-Arousal

\mathcal{P}_{VA}

ResNet50

0.31

0.17

0.24

Table 3: Expression Challenge Results on Aff-Wild2’s validation set; evaluation criterion is the average F1 Score of 8 expressions

Baseline

\mathcal{P}_{EXPR}

VGGFACE

0.23

Table 4: Action Unit Challenge Results on Aff-Wild2’s validation set; evaluation criterion is the average F1 Score of 12 AUs

Baseline

\mathcal{P}_{AU}

VGGFACE

0.39

Table 5: Multi-Task-Learning Challenge Results on Aff-Wild2’s validation set; evaluation criterion is the sum of each task’s independent performance metric

Baseline

\mathcal{P}_{MTL}

VGGFACE

0.30

4.1 Baseline Systems

Valence-Arousal Estimation Challenge

The baseline network is a ResNet one with 50 layers, pre-trained on ImageNet (ResNet50) and with a (linear) output layer that gives final estimates for valence and arousal.

Expression Recognition Challenge

The baseline network is a VGG16 network with fixed (i.e., non-trainable) convolutional weights (only the 3 fully connected layers were trainable), pre-trained on the VGGFACE dataset and with an output layer equipped with softmax activation function which gives the 8 expression predictions.

Action Unit Detection Challenge

The baseline network is a VGG16 network with fixed convolutional weights (only the 3 fully connected layers were trained), pre-trained on the VGGFACE dataset and with an output layer equipped with sigmoid activation function which gives the 12 action unit predictions.

Multi-Task-Learning Challenge

The baseline network is a VGG16 network with with fixed convolutional weights (only the 3 fully connected layers were trained), pre-trained on the VGGFACE dataset. The output layer consists of 22 units: 2 linear units that give the valence and arousal predictions; 8 units equipped with softmax activation function that give the expression predictions; 12 units equipped with sigmoid activation function that give the action unit predictions.

4.2 Results

Table 2 presents the CCC evaluation of valence and arousal predictions on the Aff-Wild2 validation set, of the baseline network (ResNet50). Table 3 presents the performance, in the Expression Classification Challenge, on the validation set of Aff-Wild2, of the baseline network (VGGFACE). Table 4 presents the performance, in the Action Unit Detection Challenge, on the validation set of Aff-Wild2, of the baseline network (VGGFACE). Table 5 presents the performance, in the Multi-Task-Learning Challenge, on the validation set of Aff-Wild2, of the baseline network (VGGFACE).

5 Conclusion

In this paper we have presented the third Affective Behavior Analysis in-the-wild Competition (ABAW) 2022 to be held in conjunction with IEEE CVPR 2022. This Competition followed the first and second ABAW Competitions held in conjunction with IEEE FG 2020 and ICCV 2021, respectively. This Competition comprises four Challenges targeting: i) uni-task valence-arousal estimation, ii) uni-task expression classification, iii) uni-task action unit detection and iv) multi-task-learning. The database utilized for this Competition has been derived from the Aff-Wild2, the first and large-scale database annotated for all these three behavior tasks.

References

[1] U Rajendra Acharya, Shu Lih Oh, Yuki Hagiwara, Jen Hong Tan, Hojjat Adeli, and D Puthankattil Subha. Automated eeg-based screening of depression using deep convolutional neural network. Computer methods and programs in biomedicine, 161:103–113, 2018.
[2] Panagiotis Antoniadis, Ioannis Pikoulis, Panagiotis P Filntisis, and Petros Maragos. An audiovisual and contextual approach for categorical and continuous emotion recognition in-the-wild. arXiv preprint arXiv:2107.03465, 2021.
[3] Wei-Yi Chang, Shih-Huan Hsu, and Jen-Hsien Chien. Fatauva-net : An integrated deep learning framework for facial attribute recognition, action unit (au) detection, and valence-arousal estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshop, 2017.
[4] Shizhe Chen, Qin Jin, Jinming Zhao, and Shuai Wang. Multimodal multi-task learning for dimensional and continuous emotion recognition. In Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge, pages 19–26. ACM, 2017.
[5] Roddy Cowie, Ellen Douglas-Cowie, Susie Savvidou*, Edelle McMahon, Martin Sawey, and Marc Schröder. ’feeltrace’: An instrument for recording perceived emotion in real time. In ISCA tutorial and research workshop (ITRW) on speech and emotion, 2000.
[6] Charles Darwin and Phillip Prodger. The expression of the emotions in man and animals. Oxford University Press, USA, 1998.
[7] Didan Deng, Zhaokang Chen, and Bertram E Shi. Fau, facial expressions, valence and arousal: A multi-task solution. arXiv preprint arXiv:2002.03557, 2020.
[8] Didan Deng, Zhaokang Chen, and Bertram E Shi. Multitask emotion recognition with incomplete labels. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pages 592–599. IEEE, 2020.
[9] Didan Deng, Liang Wu, and Bertram E Shi. Towards better uncertainty: Iterative training of efficient networks for multitask emotion recognition. arXiv preprint arXiv:2108.04228, 2021.
[10] Jiankang Deng, Jia Guo, Evangelos Ververas, Irene Kotsia, and Stefanos Zafeiriou. Retinaface: Single-shot multi-level face localisation in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5203–5212, 2020.
[11] Nhu-Tai Do, Tram-Tran Nguyen-Quynh, and Soo-Hyung Kim. Affective expression analysis in-the-wild using multi-task temporal statistical deep learning model. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pages 624–628. IEEE, 2020.
[12] Denis Dresvyanskiy, Elena Ryumina, Heysem Kaya, Maxim Markitantov, Alexey Karpov, and Wolfgang Minker. An audio-video deep and transfer learning framework for multimodal emotion recognition in the wild. arXiv preprint arXiv:2010.03692, 2020.
[13] Paul Ekman. Facial action coding system (facs). A human face, 2002.
[14] Nico H Frijda et al. The emotions. Cambridge University Press, 1986.
[15] Darshan Gera and S Balasubramanian. Affect expression behaviour analysis in the wild using spatio-channel attention and complementary context information. arXiv preprint arXiv:2009.14440, 2020.
[16] Darshan Gera and S Balasubramanian. Affect expression behaviour analysis in the wild using consensual collaborative training. arXiv preprint arXiv:2107.05736, 2021.
[17] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.
[18] Shalom Greene, Himanshu Thapliyal, and Allison Caban-Holt. A survey of affective computing for stress detection: Evaluating technologies in stress detection for better health. IEEE Consumer Electronics Magazine, 5(4):44–56, 2016.
[19] Shizhong Han, Zibo Meng, Ahmed-Shehab Khan, and Yan Tong. Incremental boosting convolutional neural network for facial action unit recognition. In Advances in neural information processing systems, pages 109–117, 2016.
[20] Xianpeng Ji, Yu Ding, Lincheng Li, Yu Chen, and Changjie Fan. Multi-label relation modeling in facial action units detection. arXiv preprint arXiv:2002.01105, 2020.
[21] Yue Jin, Tianqing Zheng, Chao Gao, and Guoqiang Xu. A multi-modal and multi-task learning method for action unit and expression recognition. arXiv preprint arXiv:2107.04187, 2021.
[22] Junghoe Kim, Vince D Calhoun, Eunsoo Shim, and Jong-Hwan Lee. Deep neural network with weight sparsity control and pre-training extracts hierarchical features and enhances classification performance: Evidence from whole-brain resting-state functional connectivity patterns of schizophrenia. Neuroimage, 124:127–146, 2016.
[23] Dimitrios Kollias, Mihalis A Nicolaou, Irene Kotsia, Guoying Zhao, and Stefanos Zafeiriou. Recognition of affect in the wild using deep neural networks. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on, pages 1972–1979. IEEE, 2017.
[24] Dimitrios Kollias, Attila Schulc, Elnar Hajiyev, and Stefanos Zafeiriou. Analysing affective behavior in the first abaw 2020 competition. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020)(FG), pages 794–800. IEEE Computer Society, 2020.
[25] Dimitrios Kollias, Viktoriia Sharmanska, and Stefanos Zafeiriou. Face behavior a la carte: Expressions, affect and action units in a single network. arXiv preprint arXiv:1910.11111, 2019.
[26] Dimitrios Kollias, Viktoriia Sharmanska, and Stefanos Zafeiriou. Distribution matching for heterogeneous multi-task learning: a large-scale face study. arXiv preprint arXiv:2105.03790, 2021.
[27] Dimitrios Kollias, Panagiotis Tzirakis, Mihalis A Nicolaou, Athanasios Papaioannou, Guoying Zhao, Björn Schuller, Irene Kotsia, and Stefanos Zafeiriou. Deep affect prediction in-the-wild: Aff-wild database and challenge, deep architectures, and beyond. International Journal of Computer Vision, 127(6-7):907–929, 2019.
[28] Dimitrios Kollias and Stefanos Zafeiriou. Aff-wild2: Extending the aff-wild database for affect recognition. arXiv preprint arXiv:1811.07770, 2018.
[29] Dimitrios Kollias and Stefanos Zafeiriou. A multi-task learning & generation framework: Valence-arousal, action units & primary expressions. arXiv preprint arXiv:1811.07771, 2018.
[30] Dimitrios Kollias and Stefanos Zafeiriou. Expression, affect, action unit recognition: Aff-wild2, multi-task learning and arcface. arXiv preprint arXiv:1910.04855, 2019.
[31] Dimitrios Kollias and Stefanos Zafeiriou. Affect analysis in-the-wild: Valence-arousal, expressions, action units and a unified framework. arXiv preprint arXiv:2103.15792, 2021.
[32] Dimitrios Kollias and Stefanos Zafeiriou. Analysing affective behavior in the second abaw2 competition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3652–3660, 2021.
[33] Felix Kuhnke, Lars Rumberg, and Jörn Ostermann. Two-stream aural-visual affect analysis in the wild. arXiv preprint arXiv:2002.03399, 2020.
[34] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
[35] I Li et al. Technical report for valence-arousal estimation on affwild2 dataset. arXiv preprint arXiv:2105.01502, 2021.
[36] Hanyu Liu, Jiabei Zeng, Shiguang Shan, and Xilin Chen. Emotion recognition for in-the-wild videos. arXiv preprint arXiv:2002.05447, 2020.
[37] Shuyi Mao, Xinqi Fan, and Xiaojiang Peng. Spatial and temporal networks for facial expression recognition in the wild videos. arXiv preprint arXiv:2107.05160, 2021.
[38] Ibrahim M Nasser, Mohammed O Al-Shawwa, and Samy S Abu-Naser. Artificial neural network for diagnose autism spectrum disorder. 2019.
[39] Geesung Oh, Euiseok Jeong, and Sejoon Lim. Causal affect prediction model using a facial image sequence. arXiv preprint arXiv:2107.03886, 2021.
[40] Jaspar Pahl, Ines Rieger, and Dominik Seuss. Multi-label class balancing algorithm for action unit detection. arXiv preprint arXiv:2002.03238, 2020.
[41] Hai X Pham, Yuting Wang, and Vladimir Pavlovic. Generative adversarial talking head: Bringing portraits to life with a weakly supervised neural network. arXiv preprint arXiv:1803.07716, 2018.
[42] Albert Pumarola, Antonio Agudo, Aleix M Martinez, Alberto Sanfeliu, and Francesc Moreno-Noguer. Ganimation: Anatomically-aware facial animation from a single image. In Proceedings of the European Conference on Computer Vision (ECCV), pages 818–833, 2018.
[43] James A Russell. Evidence of convergent validity on the dimensions of affect. Journal of personality and social psychology, 36(10):1152, 1978.
[44] Junya Saito, Xiaoyu Mi, Akiyoshi Uchida, Sachihiro Youoku, Takahisa Yamamoto, and Kentaro Murase. Action units recognition using improved pairwise deep architecture. arXiv preprint arXiv:2107.03143, 2021.
[45] Justus Thies, Michael Zollhofer, Marc Stamminger, Christian Theobalt, and Matthias Nießner. Face2face: Real-time face capture and reenactment of rgb videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2387–2395, 2016.
[46] Justus Thies, Michael Zollhöfer, Christian Theobalt, Marc Stamminger, and Matthias Nießner. Headon: real-time reenactment of human portrait videos. ACM Transactions on Graphics (TOG), 37(4):164, 2018.
[47] Manh Tu Vu and Marie Beurton-Aimar. Multitask multi-database emotion recognition. arXiv preprint arXiv:2107.04127, 2021.
[48] Lingfeng Wang and Shisen Wang. A multi-task mean teacher for semi-supervised facial affective behavior analysis. arXiv preprint arXiv:2107.04225, 2021.
[49] CM Whissel. The dictionary of affect in language, emotion: Theory, research and experience: vol. 4, the measurement of emotions, r. Plutchik and H. Kellerman, Eds., New York: Academic, 1989.
[50] Hong-Xia Xie, I Li, Ling Lo, Hong-Han Shuai, Wen-Huang Cheng, et al. Technical report for valence-arousal estimation in abaw2 challenge. arXiv preprint arXiv:2107.03891, 2021.
[51] Sachihiro Youoku, Yuushi Toyoda, Takahisa Yamamoto, Junya Saito, Ryosuke Kawamura, Xiaoyu Mi, and Kentaro Murase. A multi-term and multi-task analyzing framework for affective analysis in-the-wild. arXiv preprint arXiv:2009.13885, 2020.
[52] Stefanos Zafeiriou, Dimitrios Kollias, Mihalis A Nicolaou, Athanasios Papaioannou, Guoying Zhao, and Irene Kotsia. Aff-wild: Valence and arousal’in-the-wild’challenge. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 34–41, 2017.
[53] Sebastian Zepf, Javier Hernandez, Alexander Schmitt, Wolfgang Minker, and Rosalind W Picard. Driver emotion recognition for intelligent vehicles: A survey. ACM Computing Surveys (CSUR), 53(3):1–30, 2020.
[54] Su Zhang, Yi Ding, Ziquan Wei, and Cuntai Guan. Audio-visual attentive fusion for continuous emotion recognition. arXiv preprint arXiv:2107.01175, 2021.
[55] Wei Zhang, Zunhu Guo, Keyu Chen, Lincheng Li, Zhimeng Zhang, and Yu Ding. Prior aided streaming network for multi-task affective recognitionat the 2nd abaw2 competition. arXiv preprint arXiv:2107.03708, 2021.
[56] Yuan-Hang Zhang, Rulin Huang, Jiabei Zeng, Shiguang Shan, and Xilin Chen. $m^{3}$ t: Multi-modal continuous valence-arousal estimation in the wild. arXiv preprint arXiv:2002.02957, 2020.