This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Data Augmentation for Human Behavior Analysis in Multi-Person Conversations

Kun Li kunli.hfut@gmail.com Hefei University of TechnologyHefeiAnhuiChina230601 Dan Guo guodan@hfut.edu.cn Hefei University of Technology
Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
HefeiAnhuiChina230601
Guoliang Chen gouliangchen.hfut@gmail.com Hefei University of TechnologyHefeiAnhuiChina230601 Feiyang Liu mffff0120@gmail.com Hefei University of TechnologyHefeiAnhuiChina230601  and  Meng Wang eric.mengwang@gmail.com Hefei University of Technology
Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
HefeiAnhuiChina230601
(2023)
Abstract.

In this paper, we present the solution of our team HFUT-VUT for the MultiMediate Grand Challenge 2023 at ACM Multimedia 2023. The solution covers three sub-challenges: bodily behavior recognition, eye contact detection, and next speaker prediction. We select Swin Transformer as the baseline and exploit data augmentation strategies to address the above three tasks. Specifically, we crop the raw video to remove the noise from other parts. At the same time, we utilize data augmentation to improve the generalization of the model. As a result, our solution achieves the best results of 0.6262 for bodily behavior recognition in terms of mean average precision and the accuracy of 0.7771 for eye contact detection on the corresponding test set. In addition, our approach also achieves comparable results of 0.5281 for the next speaker prediction in terms of unweighted average recall.

Video understanding, bodily behavior recognition, eye contact detection, next speaker prediction, multi-person conversation
journalyear: 2023copyright: acmlicensedconference: Proceedings of the 31st ACM International Conference on Multimedia; October 29–November 3, 2023; Ottawa, ON, Canada.booktitle: Proceedings of the 31st ACM International Conference on Multimedia (MM ’23), October 29–November 3, 2023, Ottawa, ON, Canadaprice: 15.00doi: 10.1145/3581783.3612856isbn: 979-8-4007-0108-5/23/10ccs: Human-centered computing HCI theory, concepts and modelsccs: Computing methodologies Artificial intelligenceccs: Computing methodologies Computer vision
Refer to caption
Figure 1. Video samples from the MPIIGroupInteraction (Müller et al., 2018a) and BBSI (Balazia et al., 2022) datasets. For bodily behavior recognition, we remove the irrelevant background. For Eye contact detection, we use OpenFace (Baltrusaitis et al., 2018) to extract face region. For the next speaker prediction, we use the raw frame as input data.

1. Introduction

Refer to caption
Figure 2. The architecture overview of our solution for (a) bodily behavior recognition, (b) eye contact detection, and (c) next speaker prediction. For task a, we remove the irrelevant background for action classification. For task b, we extract the face region to train the model. For task c, we directly use the raw frames as input data for the model.

Multi-modal Group Behaviour Analysis for Artificial Mediation (Multimediate) is a grand challenge at ACM Multimedia 2023. It is launched based on the MPIIGroupInteraction dataset (Müller et al., 2018a) and the Noxi datasets (Cafaro et al., 2017) and requires researchers to study key issues in group behavior perception and analysis. This year, the challenge included six tasks, i.e., bodily behavior recognition, eye contact detection, next speaker prediction, backchannel detection, engagement estimation, and agreement estimation. We only participated in the first three tasks. The bodily behavior recognition (Müller et al., 2023) is formulated as a 14-class multi-label classification task, and participants are required to predict the likelihood of each behavior class by the 64-frame video snippets. The eye contact detection task (Müller et al., 2018b) requires predicting the absolute eye contact classes after the end of the 10s context window. The next speaker detection task (Müller et al., 2021) requires participants to predict the speaking status of each participant one second after the end of the 10s context window. Our main contributions are summarized as follows:

  • We explore different data augmentation strategies while training the model, directly leading to accuracy improvements on the baseline model.

  • We explore data processing strategy in the bodily behavior recognition task by filtering irrelevant background information to improve the model’s classification accuracy.

  • For both the bodily behavior recognition and eye contact detection tasks, our solution achieves the top-1 performance on the leader board with the mAP of 0.6262 and accuracy of 0.7771, respectively.

2. Methodology

2.1. Bodily Behaviour Recognition

Bodily behavior recognition aims to recognize the category of behavior (e.g., shrug, leg crossed, and fold arms in Figure 1) through the given videos from different views, so it is formulated as a multi-label classification task. As shown in Figure 2, given the raw video sequence VT×1000×1000×3V\in\mathbb{R}^{T\times 1000\times 1000\times 3}, we first remove the irrelevant background to construct the cropped video VT×1000×600×3V^{\prime}\in\mathbb{R}^{T\times 1000\times 600\times 3}. Subsequently, following the official baseline (Müller et al., 2023), we utilize the video swin transformer (Liu et al., 2022b) as the baseline to build a multi-label classification model. Data augmentation plays a crucial role in deep learning. By expanding and diversifying the training dataset, it enhances the model’s ability to generalize and learn robust representations. As shown in Figure 2, we can see that the participants’ clothing and background colors in the video are very similar, which can easily cause confusion between the foreground and background, affecting the accuracy of model recognition. Based on this characteristic, we utilize the “ColorJitter” data augmentation strategy to enhance the sensitivity of the model to the background. For detailed analysis and experimental results, please refer to Section 3.2.

2.2. Eye Contact Detection

Eye contact detection aims to estimate the gaze direction of each participant. In this sub-challenge, eye contact is defined as a discrete indication of whether a participant is looking at another participant’s face. Thus, this sub-challenge can be formulated as a multi-class (i.e., left, frontal, right, and no eye contact) classification task. Following previous work (Müller et al., 2021; Ma et al., 2022), as shown in Figure 2 (b), we first use the OpenFace 2.0 (Baltrusaitis et al., 2018) to detect and align faces from the original video and crop them to 224×\times224 images. Then, these images are input to swin transformer (Liu et al., 2021) for multi-class classification. Similar to the bodily behavior recognition task, we utilize the “ColorJitter” and “Lighting” data augmentation strategies into the training process to improve the robustness of the model.

2.3. Next Speaker Prediction

Next speaker prediction aims to predict the speaking status of each participant one second after the end of the 10s context window. Thus, this sub-challenge can be formulated as a multi-label classification task. Unlike previous work (Ma et al., 2022), we attempt to predict the next speaker directly from the original video frame sequence without using cropped facial features. As shown in Figure 2 (c), we first extract the video frames, then concatenate these frames according to the order in the annotation file. Subsequently, the concatenated frames are input to the video swin transformer (Liu et al., 2022b) to perform multi-label classification to predict the next speaker. Due to the limited time available for the challenge, we did not have time to attempt to extract facial features or apply data augmentation to improve the accuracy of predictions.

3. Experiments

3.1. Experiment Setup

3.1.1. Datasets

The MultiMediate’23 Challenge utilized the MPIIGroupInteraction dataset (Müller et al., 2018a) with the BBSI annotation (Balazia et al., 2022) for bodily behavior recognition, eye contact detection, and next speaker prediction. The dataset contains bodily behavior categories and one background category, with a total duration of 26 hours. For bodily behavior recognition task, there are a total of 43,712 video clips, of which the number of video instances in the training, validation, and test sets are 31,221, 11,496, and 995, respectively. For eye contact detection and next speaker prediction sub-challenges, the data were taken from the MPIIGroupInteraction dataset, consisting of 22 group conversations and 78 German-speaking participants. Each group consists of three to four participants, with a video camera behind each participant to record the content of the current participant’s perspective. Specifically, for the eye contact detection task, there are 4,504 training samples, 1,672 validation samples, and 1,848 test samples. For the next speaker prediction task, there are 7,546 training samples, 4,036 validation samples, and 5,222 test samples. In addition, this challenge also provided the individual viewpoints of each participant, so each sample has 3 to 4 viewpoints. The training sample has 30,184 multi-view videos, while the validation and test samples consist of 16,144 and 20,888 multi-view video files, respectively.

3.1.2. Implementation Details

Data Processing: For the bodily behavior recognition task, the resolution of the officially provided raw video was 1000×\times1000. However, due to the camera viewpoint, the rest of the participants occasionally appeared in the video, which caused some interference to the action recognition of the participants in the middle position. Therefore, we made an initial crop of the video by cropping the left and right by 200 pixels each to get a video with the resolution of 600×\times1000. For the eye contact detection task, we first utilized OpenFace (Baltrusaitis et al., 2018) to detect and align faces from the original video and cropped them to 224×\times224 images. For frames where faces could not be detected, we used a black-filled image instead. For the next speaker prediction task, we only adopted the last frame of each video. Considering that the group interaction has 4 different viewpoints, we use the order of viewpoint labeling in the annotation to combine the frames into a sequence to feed into the model.

Model Parameters: For the video swin transformer (Liu et al., 2022b) in bodily behavior recognition, we use the base version pre-trained on kinetics-400 dataset (Kay et al., 2017). We set the batch size to 2 with an initial learning rate of 3e-4. In addition, the model is trained in 10 epochs with an early stop strategy. For bodily behavior recognition and next speaker prediction tasks, our code is implemented with the open-source toolbox MMAction2 (Contributors, 2020). For eye contact detection, we implement the code with the toolbox MMPreTrain (Contributors, 2023).

Evaluation Metrics: The bodily behavior recognition is a multi-label classification task, the mean average precision (mAP) is adopted as the evaluation metric. Eye contact detection is formulated as sing-label classification; thus, we utilize the accuracy to evaluate the model’s performance. For the next speaker prediction, we adopt unweighted average recall (UAR) over all samples to evaluate the model.

Table 1. Performance comparison of different methods on bodily behavior recognition in terms of mAP (mean average precision) metric.
Setting Method Val Test
random Baseline (Müller et al., 2023) 0.0884 0.2355
w/o bg, frontal view Baseline (Müller et al., 2023) 0.3974 0.5315
w/o bg, mean of all views Baseline (Müller et al., 2023) 0.4099 0.5628
w/o bg, frontal view PoseC3D (Duan et al., 2022) 0.3810 -
w/o bg, cropped frontal view Ours 0.4822 0.6262
Table 2. Ablation study of data augmentation in bodily behavior recognition in terms of mAP metric. “CJ” and “RE” denote the “ColorJitter” and “RandomErasing” data augmentation strategies, respectively.
Setting Method Val Test
w/o bg, frontal view Baseline  (Müller et al., 2023) 0.4679 -
w/o bg, frontal view  (Müller et al., 2023)+ CJ + RE 0.4688 0.6143
w/o bg, cropped frontal view Baseline (Müller et al., 2023) 0.4757 0.6139
w/o bg, cropped frontal view  (Müller et al., 2023)+CJ+RE (Ours) 0.4822 0.6262

3.2. Results for Bodily Behavior Recognition

As shown in Table 1, we report the performance comparison between our approach and other methods. Considering the large amount of data for behavior recognition, we first attempt to use the officially provided skeleton data. Specifically, we trained the PoseC3D model (Duan et al., 2022) using frontal view skeleton data as input and achieved a mAP of 0.3810 on the validation dataset. We believe that directly using the PoseC3D model cannot achieve good results in action recognition, and further adjustments are needed to the input features. Considering the limited number of online evaluations, we did not make in-depth improvements on the skeleton-based action recognition models. Compared with the baseline with the frontal view as input, our approach achieves the best performance of 0.6262 on the test set with 9.47% improvements. Due to limitations in the training duration of the model, we did not use videos from three views to train the model. Observing the third row of Table 1, we can see that model using videos from all views can significantly improve performance (i.e., 0.5628 vs. 0.5315 on the test set). We believe our method will further improve if we use videos from three views to train the model. As shown in Table 2, we give the ablation study results of our method. The RandomErasing (Zhong et al., 2020) strategy randomly erases a rectangle region with random values, which is widely used in image and video classification tasks. Specifically, we increase the probability of “RandomErasing” from the initial 0.25 to 0.5. Observing the first and third rows of Table 2, we can see that the cropped videos can improve the mAP, i.e., (0.4757 vs. 0.4679 on the val set). Observing the third and fourth rows, we can conclude that the “CorlorJitter” and “RandomErasing” data augmentation strategies can further enhance the mAP of the model (i.e., 0.6262 vs. 0.6143 on the test set).

3.3. Results for Eye Contact Detection

The validation and test set results of eye contact detection are shown in Table 3. Compared with SVM-RBF (Müller et al., 2021), our method achieves a relative improvement of 49.44% in terms of accuracy on the test set. MH (Fu and Ngai, 2021) and Baseline (Müller et al., 2022) employ different feature sampling strategies. For improved accuracy, the baseline trains four eye contact detection models for each seating position and produces an accuracy of 0.64 and 0.576 on the validation and test sets, respectively. We use the face image extracted by OpenFace 2.0 (Baltrusaitis et al., 2018) as input. Compared with the latest methods (i.e., TA-CNN (Ma et al., 2022) and TA-CNN (Ma et al., 2022)), our method achieves the best accuracy of 0.8122 on validation set and 0.7771 on the test set. The significant improvement means that our method has superior generalization capabilities.

The ablation study results of eye contact detection are shown in Table 4. Firstly, we use the last frame face feature of each video as input and use Swin V1-B (base model) and Swin V1-L (large model) as baseline methods. On the validation set, Swin V1-B achieves 0.8175, i.e., 0.71% higher than Swin V1-L. We attribute this to the fact that the Swin V1-L model may be difficult to fully converge on this small dataset. Secondly, we expand the dataset using face images from the last 5 frames of each video and add two data augmentation methods. “ColorJitter” can randomly change the brightness, contrast, and saturation of an image. “Lighting” can adjust image lighting using AlexNet-style PCA jitter. These two data augmentation strategies are widely image transforms. Both data augmentation strategies contribute enhancement to the results on the validation set. Finally, we experiment on the latest Swin V2 model (Liu et al., 2022a) as well and utilize the data augmentation methods mentioned above, reaching 0.8307 on the validation set. Due to limitations on the number of tests, we are unable to provide results for the test set.

Table 3. Performance comparison of different methods on eye contact detection in terms of accuracy metric. denotes that both training and validation sets are used to train the model.
Feature Method Venue Val Test
OpenFace SVM-RBF (Müller et al., 2021) MM’21 0.52
MHI MH (Fu and Ngai, 2021) MM’21 - 0.56
Gaze Baseline (Müller et al., 2022) MM’22 0.61 -
Headpose Baseline (Müller et al., 2022) MM’22 0.61 -
Gaze + Headpose Baseline (Müller et al., 2022) MM’22 0.64 0.576
OpenFace TA-CNN (Ma et al., 2022) MM’22 0.7605 0.7148
OpenFace TA-CNN (Ma et al., 2022) MM’22 - 0.7261
OpenFace Ours 2023 0.8122 0.7771
Table 4. Ablation study of our model in eye contact detection in terms of Acc metric. #1 and #5 denote 1 and 5 frames are used, respectively. “CJ”, “L” and “RA” denote the “ColorJitter”, “Lighting” and “RandAugment” data augmentation strategies, respectively.
Feature Backbone Method Val
OpenFace (#1) Swin V1-B (Liu et al., 2021) Baseline 0.8175
OpenFace (#1) Swin V1-L (Liu et al., 2021) Baseline 0.8104
OpenFace (#5) Swin V1-B (Liu et al., 2021) Baseline 0.8122
OpenFace (#5) Swin V1-B (Liu et al., 2021) +CJ 0.8188
OpenFace (#5) Swin V1-B (Liu et al., 2021) +L 0.8205
OpenFace (#1) Swin V2-B (Liu et al., 2021) Baseline 0.8272
OpenFace (#1) Swin V2-B (Liu et al., 2021) +RA 0.8277
OpenFace (#1) Swin V2-B (Liu et al., 2021) +L 0.8307

3.4. Results for Next Speaker Prediction

As shown in Table 5, we report the experimental results compared to previous methods. The baseline methods utilized static and dynamic features extracted over the complete input video (frame-by-frame) via OpenFace 2.0. GLFVA (Birmingham et al., 2021) used the combination of group-level focus on visual attention features and publicly available audio-video synchronizer models, which achieved the best score of 0.632 on the test set in this sub-challenge. TA-CNN (Ma et al., 2022) utilized group-level features, achieving 0.6538 and 0.5766 on the validation and test sets, respectively. Similar to the sampling strategy of TA-CNN, we extracted the last frame of the video from different viewpoints for each sample. Then, the frame sequence organized according to the order of viewpoints is used as input to the network. Considering the limited computing resource and evaluation chance, we have not yet had time to experiment with the methodology of previous work (Birmingham et al., 2021; Ma et al., 2022) extracting facial features as input to the model. Unlike the above methods with individually cropped face images, we attempted to use the complete original frames to capture information about changes in the physical behavior of different participants and predict the next speaker based on it. Although our results on the test set reached only 0.5281, a new attempt was made to explore the task of predicting the next speaker.

Table 5. Performance comparison of different methods on Next Speaker Prediction in terms of UAR (unweighted average recall) metric. denotes that both training and validation sets are used to train the model.
Feature Method Venue Val Test
Group S Baseline (Müller et al., 2021) MM’21 0.60 -
Group D Baseline (Müller et al., 2021) MM’21 0.53 -
Group (S+D) Baseline (Müller et al., 2021) MM’21 0.60 -
Video & Audio GLFVA (Birmingham et al., 2021) MM’21 - 0.632
OpenFace TA-CNN (Ma et al., 2022) MM’22 0.6538 0.5766
OpenFace TA-CNN (Ma et al., 2022) MM’22 - 0.5965
Raw frame Ours 2023 - 0.5281

4. Conclusions

In this work, we present our solution developed for the Multimediate grand challenge hosted at ACM Multimedia 2023. For bodily behavior recognition, by incorporating the video swin transformer with heuristic data processing and data augmentation strategies, our approach achieved first place with the mAP value of 0.6262. For eye contact detection, the face region extracted by OpenFace from the last five frames is input to the Swin transformer baseline, resulting in first place with an accuracy of 0.7771. For the next speaker prediction task, we utilized raw frames as the input of the video swin transformer and achieved a comparable performance of 0.5281 on the test set.

In the future, we would like to apply skeleton-based action recognition methods (Li et al., 2023) and mask irrelevant backgrounds with segmentation map (Agrawal et al., 2023) for bodily behavior recognition. In addition, we also would like to exploit the temporal relation (Li et al., 2021; Guo et al., 2019) in the video sequence for eye contact detection. For the next speaker prediction task, we plan to use the pre-trained model on the face recognition dataset to extract the face region feature to improve the classification accuracy of the model.

Acknowledgements.
This work was supported by the National Key R&D Program of China (2022YFB4500600), the National Natural Science Foundation of China (62272144, 72188101, 62020106007, and U20A20183), and the Major Project of Anhui Province (202203a05020011).

References

  • (1)
  • Agrawal et al. (2023) Tanay Agrawal, Michal Balazia, Philipp Müller, and François Brémond. 2023. Multimodal Vision Transformers with Forced Attention for Behavior Analysis. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 3392–3402.
  • Balazia et al. (2022) Michal Balazia, Philipp Müller, Ákos Levente Tánczos, August von Liechtenstein, and Francois Bremond. 2022. Bodily behaviors in social interaction: Novel annotations and state-of-the-art evaluation. In Proceedings of the 30th ACM International Conference on Multimedia. 70–79.
  • Baltrusaitis et al. (2018) Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency. 2018. Openface 2.0: Facial behavior analysis toolkit. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition. IEEE, 59–66.
  • Birmingham et al. (2021) Chris Birmingham, Kalin Stefanov, and Maja J Mataric. 2021. Group-level focus of visual attention for improved next speaker prediction. In Proceedings of the 29th ACM International Conference on Multimedia. 4838–4842.
  • Cafaro et al. (2017) Angelo Cafaro, Johannes Wagner, Tobias Baur, Soumia Dermouche, Mercedes Torres Torres, Catherine Pelachaud, Elisabeth André, and Michel Valstar. 2017. The NoXi database: multimodal recordings of mediated novice-expert interactions. In Proceedings of the 19th ACM International Conference on Multimodal Interaction. 350–359.
  • Contributors (2020) MMAction2 Contributors. 2020. OpenMMLab’s Next Generation Video Understanding Toolbox and Benchmark. https://github.com/open-mmlab/mmaction2.
  • Contributors (2023) MMPreTrain Contributors. 2023. OpenMMLab’s Pre-training Toolbox and Benchmark. https://github.com/open-mmlab/mmpretrain.
  • Duan et al. (2022) Haodong Duan, Yue Zhao, Kai Chen, Dahua Lin, and Bo Dai. 2022. Revisiting skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2969–2978.
  • Fu and Ngai (2021) Eugene Yujun Fu and Michael W Ngai. 2021. Using motion histories for eye contact detection in multiperson group conversations. In Proceedings of the 29th ACM International Conference on Multimedia. 4873–4877.
  • Guo et al. (2019) Dan Guo, Shengeng Tang, and Meng Wang. 2019. Connectionist temporal modeling of video and language: a joint model for translation and sign labeling.. In Proceedings of the 28th International Joint Conference on Artificial Intelligence. 751–757.
  • Kay et al. (2017) Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).
  • Li et al. (2023) Kun Li, Dan Guo, Guoliang Chen, Xinge Peng, and Meng Wang. 2023. Joint Skeletal and Semantic Embedding Loss for Micro-gesture Classification. arXiv preprint arXiv:2307.10624 (2023).
  • Li et al. (2021) Kun Li, Dan Guo, and Meng Wang. 2021. Proposal-free video grounding with contextual pyramid network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 1902–1910.
  • Liu et al. (2022a) Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. 2022a. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12009–12019.
  • Liu et al. (2021) Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10012–10022.
  • Liu et al. (2022b) Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2022b. Video swin transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3202–3211.
  • Ma et al. (2022) Fuyan Ma, Ziyu Ma, Bin Sun, and Shutao Li. 2022. TA-CNN: A Unified Network for Human Behavior Analysis in Multi-Person Conversations. In Proceedings of the 30th ACM International Conference on Multimedia. 7099–7103.
  • Müller et al. (2023) Philipp Müller, Michal Balazia, Tobias Baur, Michael Dietz, Alexander Heimerl, Dominik Schiller, Mohammed Guermal, Dominike Thomas, François Brémond, Jan Alexandersson, Elisabeth André, and Andreas Bulling. 2023. MultiMediate’23: Engagement Estimation and Bodily Behaviour Recognition in Social Interactions. In Proceedings of the 31st ACM International Conference on Multimedia.
  • Müller et al. (2022) Philipp Müller, Michael Dietz, Dominik Schiller, Dominike Thomas, Hali Lindsay, Patrick Gebhard, Elisabeth André, and Andreas Bulling. 2022. MultiMediate’22: Backchannel Detection and Agreement Estimation in Group Interactions. In Proceedings of the 30th ACM International Conference on Multimedia. 7109–7114.
  • Müller et al. (2021) Philipp Müller, Michael Dietz, Dominik Schiller, Dominike Thomas, Guanhua Zhang, Patrick Gebhard, Elisabeth André, and Andreas Bulling. 2021. Multimediate: Multi-modal group behaviour analysis for artificial mediation. In Proceedings of the 29th ACM International Conference on Multimedia. 4878–4882.
  • Müller et al. (2018a) Philipp Müller, Michael Xuelin Huang, and Andreas Bulling. 2018a. Detecting low rapport during natural interactions in small groups from non-verbal behaviour. In 23rd International Conference on Intelligent User Interfaces. 153–164.
  • Müller et al. (2018b) Philipp Müller, Michael Xuelin Huang, Xucong Zhang, and Andreas Bulling. 2018b. Robust eye contact detection in natural multi-person interactions using gaze and speaking behaviour. In Proceedings of the 2018 ACM Symposium on Eye Tracking Research & Applications. 1–10.
  • Zhong et al. (2020) Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. 2020. Random erasing data augmentation. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 13001–13008.