AnxietyFaceTrack: A Smartphone-Based Non-Intrusive Approach for Detecting Social Anxiety Using Facial Features
Abstract.
Social Anxiety Disorder (SAD) is a widespread mental health condition, yet its lack of objective markers hinders timely detection and intervention. While previous research has focused on behavioral and non-verbal markers of SAD in structured activities (e.g., speeches or interviews), these settings fail to replicate real-world, unstructured social interactions fully. Identifying non-verbal markers in naturalistic, unstaged environments is essential for developing ubiquitous and non-intrusive monitoring solutions. To address this gap, we present AnxietyFaceTrack, a study leveraging facial video analysis to detect anxiety in unstaged social settings. A cohort of 91 participants engaged in a social setting with unfamiliar individuals and their facial videos were recorded using a low-cost smartphone camera. We examined facial features, including eye movements, head position, facial landmarks, and facial action units, and used self-reported survey data to establish ground truth for multiclass (anxious, neutral, non-anxious) and binary (e.g., anxious vs. neutral) classifications. Our results demonstrate that a Random Forest classifier trained on the top 20% of features achieved the highest accuracy of 91.0% for multiclass classification and an average accuracy of 92.33% across binary classifications. Notably, head position and facial landmarks yielded the best performance for individual facial regions, achieving 85.0% and 88.0% accuracy, respectively, in multiclass classification, and 89.66% and 91.0% accuracy, respectively, across binary classifications. Post-hoc analysis identified head rotation (x-axis), facial edge features, and eye landmarks as key contributors to detecting anxiety. This study introduces a non-intrusive, cost-effective solution that can be seamlessly integrated into everyday smartphones for continuous anxiety monitoring, offering a promising pathway for early detection and intervention.

1. Introduction
Social Anxiety Disorder (SAD) is characterized by excessive fear and worry and can manifest in various ways, including physical symptoms such as a racing heartbeat and sweating, mental symptoms like restlessness and pervasive fear, and behavioral symptoms such as avoidance of social activities (Szuhany and Simon, 2022). One in every eight individuals experiences a mental disorder, with anxiety being the most prevalent (Men, 2022). According to the Global Burden of Disease 2019, anxiety disorder ranks second leading mental health-related contributor to disability-adjusted life-years (DALYs) and years lived with disability (YLDs) globally (Xiong et al., 2022). Furthermore, the COVID-19 pandemic led to a significant 26% increase in the number of individuals suffering from anxiety disorders (Men, 2022).
Currently, SAD diagnosis relies upon traditional methods due to the absence of any reliable, objective markers (or measures) of anxiety disorder. The traditional method includes clinical interviews and clinically validated retrospective self-reported questionnaires. However, these traditional methods have limitations, such as clinical interviews are prone to human bias and rely on the subject self-motivation to attend the interviews that require multiple sessions, while self-reporting relies on the subject willingness to convey their behaviors and is prone to recall bias. Thus, given the high prevalence of SAD, there is a need for automated, reliable measures that are not susceptible to human bias.
Ongoing research on mental disorder detection has explored unobtrusive objective markers using methods such as wearable sensors, speech analysis, and mobile phone data (Rashid et al., 2020; Salekin et al., 2018). However, the findings remain inconclusive. While facial images and videos have shown promise in detecting depression (Nepal et al., 2024; Ringeval et al., 2019), there is a notable lack of studies focusing on the detection of SAD using these methods. Prior research indicates that individuals with SAD often exhibit behavioral and non-verbal cues (Gilboa-Schechtman and Shachar-Lavie, 2013), including restlessness (Chand and Marwaha, 2024), reduced motion (Chand and Marwaha, 2024), avoidance of eye contact (Schneier et al., 2011), gaze fixation, slumped posture, and closed body language (Weeks et al., 2011).
Existing research on detecting mental disorders or emotional states from video predominantly relies on participants engaging in structured, anxiety-inducing tasks, such as delivering speeches (Harshit et al., 2024), introducing themselves (Shafique et al., 2022; Giannakakis et al., 2017), or watching stressful videos (Giannakakis et al., 2017). These studies often depend on high-end recording equipment (Gavrilescu and Vizireanu, 2019), such as RGB-Depth cameras (Horigome et al., 2020), which may not be feasible for widespread use. However, there is a significant gap in exploring whether anxiety can be detected in natural, unstaged social interactions using low-cost video cameras without requiring participants to engage in specific activities. Addressing this gap is critical for making anxiety detection more accessible and applicable to real-world scenarios.
To address the gap in detecting SAD in natural social settings without requiring participants to perform specific activities, we developed AnxietyFaceTrack. It enables the observation of natural expressions of social anxiety without the influence of artificial tasks or prompts. In the study, we invited participants, unfamiliar with one another, to sit in trios in a simulated social scenario. Each participant was positioned to face unknown individuals seated in front, left, and right, replicating real-life situations of encountering strangers. During the 2-minute session, participants were recorded using a dedicated low-cost smartphone camera (approximately $81), which captured their upper body, including the face, shoulders, and chest. This design ensures accessibility and ecological validity, providing valuable insights into behavioral markers of SAD in naturalistic contexts.
A total of 91 university students participated in the study. Behavioral features, such as head position, gaze direction, and facial expressions, were extracted from the recorded videos and used to train classification models for anxiety detection. We developed a multiclass classification model to categorize participants as anxious, neutral, or non-anxious, using self-reported ground truth labels provided by the participants. This multiclass approach was adopted to account for neutral behaviors, which were observed among both SAD and non-SAD participants, and to reduce potential bias in the classification. Also, we evaluated binary classification performance as shown in Figure 1 by excluding one category to focus on the distinction between two class behaviors. For example, in the “anxious versus non-anxious” model, only participants labeled as anxious or non-anxious were included, and the neutral was dropped.
The key contributions of our work are as follows:
-
(1)
We present a non-intrusive approach for detecting anxiety in normal settings using a low-cost smartphone camera that can integrated into daily use smartphones for continuous monitoring of anxiety, thus prompting early interventions.
-
(2)
We evaluated several machine learning and deep learning models for anxiety detection using facial features. We tested these models on 669 facial features and their subsets. Our results show that the Random Forest model outperformed others in nearly all classification metrics for both multiclass (Accuracy- 91%, F1 score- 0.90, AUC- 0.98) and binary classification (Anxious vs. Neutral: Accuracy- 92%, F1 score- 0.91, AUC- 0.98; Anxious vs. Non-Anxious: Accuracy- 92%, F1 score- 0.89, AUC- 0.98; Neutral vs. Non-Anxious: Accuracy- 93%, F1 score- 0.94, AUC- 0.98).
-
(3)
We identified key features that helped the Random Forest model correctly identify anxious participants and other classes. For example, larger head rotation along the X-axis, face edge features (such as jawline points), and eye landmarks were important for anxiety detection in AnxietyFaceTrack.
- (4)
AnxietyFaceTrack contributes to affective computing, showcasing the use case of facial features captured through videos for anxiety detection. The use of low-cost smartphone cameras and machine learning models using facial cues might offer a practical solution for continuously monitoring mental disorders, thus reducing the treatment gap and prompting early interventions. Furthermore, our findings can serve as a baseline for future research conducted in controlled or uncontrolled settings for anxiety detection using facial features.
The paper is structured as follows: Section 2 presents related work on anxiety detection and studies that use videos for mental disorders. Section 3 explains our AnxietyFaceTrack study, participant demographics, ground truth, and the analysis methods used for anxiety detection. Section 4 discusses our results and the ablation study conducted to draw inferences about anxiety detection. Section 4.3 presents the important features that influenced the detection of anxiety, while Section 4.4 examines the bias in the trained models. Section 5 discusses the study’s findings, implications, and limitations, while Section 6 provides the conclusion.
2. Related Works
2.1. Non-verbal Cues of Anxiety Disorder
Over the last five decades, researchers have studied nonverbal communication in mental disorders (Waxer, 1977; Argyle, 1978; Perez and Riggio, 2003; Foley and Gentile, 2010; Schneier et al., 2011; Gilboa-Schechtman and Shachar-Lavie, 2013; Weeks et al., 2019; Asher et al., 2020; Shatz et al., 2024). These studies have found that nonverbal cues can play a significant role in diagnosing mental disorders and contribute to therapeutic processes. For instance, nonverbal signs such as a patient’s appearance, behavior, and eye contact are routinely assessed during psychiatric evaluations as part of the mental status examination (Foley and Gentile, 2010).
Existing research has analyzed video recordings of individuals with anxiety disorders in various scenarios, such as therapy sessions, task performance, video watching, dyadic conversations, etc. One of the earliest studies on this topic was conducted by Waxer (Waxer, 1977), who analyzed the nonverbal cues of individuals with anxiety. In this study, anxious and non-anxious participants (20 participants: 5 anxious males, 5 non-anxious males, 5 anxious females, and 5 non-anxious females) were videotaped at different times during the admission period. One-minute silent session videos were then shared with 46 senior psychologists. They rated ten behavior cue areas on a 10-point scale ranging from “not anxious at all” to “highly anxious” and described how these features conveyed anxiety. The ten behavior cue areas were the forehead, eyebrows, eyelids, eyes, mouth, head angle, shoulder posture, arm position, torso position, and hands. Further, using Linear regression analysis, hands, eyes, mouth, and torso were identified as a key nonverbal indicator of anxiety.
A major focus of recent studies has been on gaze behavior and its relationship with social anxiety. Schneier et al. (Schneier et al., 2011) explored gaze avoidance in individuals with generalized social anxiety disorder, healthy controls, and undergraduate students. Their findings indicate that avoiding eye contact is associated with social anxiety. Similarly, Weeks et al. (Weeks et al., 2011, 2019) conducted multiple studies on behavioral submissiveness in social anxiety. In one study, participants engaged in a role-play task with unfamiliar individuals (confederates), revealing that body collapse and gaze avoidance are linked to social anxiety (Weeks et al., 2011). In another study, Weeks et al. used eye-tracking systems to examine participants watching positive and negative video clips, further identifying gaze avoidance as a prominent marker of SAD (Weeks et al., 2019).
Nonverbal synchrony is another area of investigation in SAD. Asher et al. (Asher et al., 2020) analyzed dyadic conversations between individuals with SAD and non-anxious individuals, finding impaired nonverbal synchrony among those with SAD. Similarly, Shatz et al. (Shatz et al., 2024) examined nonverbal synchrony during diagnostic interviews, showing that individuals with SAD displayed lower levels of synchrony and reduced pacing compared to non-anxious counterparts. An in-depth review by Gilboa et al. (Gilboa-Schechtman and Shachar-Lavie, 2013) provides a comprehensive understanding of nonverbal social cues in SAD, synthesizing findings from various studies and emphasizing the role of nonverbal behaviors in the disorder. The findings from these studies underscore the role of nonverbal cues, such as gaze behavior and body posture in understanding and diagnosing SAD. These studies relied on recorded videos and human inference, thus highlighting the need for an automated tool.
2.2. Visual Features of Mental Disorders
To the best of our knowledge, Cohn et al. (Cohn et al., 2009) were the first to explore the use of automated visual features from videos for research in mental health detection. They recorded interviews between clinically depressive participants and interviewers. The study used manual and automated facial analysis coding systems (FACS) as feature inputs for machine learning. They achieved accuracies of 88% with manual features and 79% with automated features in depression detection. Furthermore, ‘AVEC 2011 – The First International Audio/Visual Emotion Challenge’ introduced automated visual features, calculated using dense local appearance descriptors, for affective computing (Schuller et al., 2011). This was done through a workshop challenge that provided an open dataset to the research community. Later, the development of OpenFace111https://cmusatyalab.github.io/openface/, based on FaceNet (Schroff et al., 2015), an advanced deep learning model, offered a unified system for detecting facial features. Later, the updated version, OpenFace 2.0 (Baltrušaitis et al., 2016), emerged as a state-of-the-art computer vision toolkit. It enabled researchers to analyze facial behavior and study nonverbal communication without requiring comprehensive programming knowledge.
Most studies that use visual features for mental health detection have focused on depression and stress (Cohn et al., 2009; Schuller et al., 2011; Valstar et al., 2014; Ringeval et al., 2019, 2017; Valstar et al., 2016, 2013; Wang et al., 2021; Gavrilescu and Vizireanu, 2019; Grimm et al., 2022; Giannakakis et al., 2017; Sun et al., 2022), with limited attention given to anxiety (Wang et al., 2021; Gavrilescu and Vizireanu, 2019; Grimm et al., 2022; Mo et al., 2024; Giannakakis et al., 2017) and even less to SAD (Harshit et al., 2024; Shafique et al., 2022). Giannakakis et al. (Giannakakis et al., 2017) used an open-source model to detect the face region of interest and applied Active Appearance Models (AAM) for emotion recognition and facial expression analysis. In their study, participants were recorded with a video camera with extra lighting while undergoing three experimental phases: a social exposure phase, an emotion recall phase, and a stress/mental task phase. The computed features were then used to detect emotional states related to stress and anxiety. They achieved an average accuracy of 87.72% in stress detection across these phases. Similarly, Sun et al. (Sun et al., 2022) utilized visual features for remote stress detection. Participants attended an online meeting and self-reported their stress levels on a scale of 1 to 10, which served as the ground truth for a binary stress classifier. The study reported an accuracy of 70.00% using motion features (eye and head movements) and 73.75% using facial expressions (action units). In another study, Grimm et al. (Grimm et al., 2022) analyzed participants’ videos captured while they answered open-ended questions. A classifier was trained using GAD-7 scores as ground truth, achieving an area under the curve (AUC) score of 0.71 for the binary classification of anxiety characteristics. Similarly, Gavrilescu et al. (Gavrilescu and Vizireanu, 2019) predicted depression, anxiety, and stress using videos captured with high-end cameras while participants watched emotion-inducing clips. They achieved accuracies of 87.2% for depression, 77.9% for anxiety, and 90.2% for stress.
Existing studies on SAD have predominantly focused on eye gaze. In these studies, participants typically complete a performance task involving interviews or the Trier Social Stress Test (TSST). For example, Shafique et al. (Shafique et al., 2022) used participants’ eye gaze data captured during a 5-minute general conversation with an examiner, covering topics such as introductions, support, and conflict. Their method achieved an accuracy of 80% in detecting the severity of SAD. In another study, Harshit et al. (Harshit et al., 2024) analyzed participant’s eye gaze while they performed a speech task as part of the TSST. Using an autoencoder, they extracted latent feature representations (deep features) from the eye gaze data, which was used as features for machine learning models, and achieved 100% accuracy in detecting participants’ anxiety.
In summary, most existing studies focus on interview-based videos or externally induced anxiety tasks, limiting their relevance to real-world, everyday scenarios. Additionally, non-verbal cues, such as facial expressions and head movements, remain underexplored in detecting SAD. To address these gaps, we designed a study set in a social environment where participants were surrounded by unfamiliar individuals and instructed to remain idle without engaging in any activity. Using a low-cost smartphone camera, we recorded videos of the participants’ faces, extracted facial features, and analyzed them for insights.
3. Methodology
In this section, we present the study design of AnxietyFaceTrack, participants’ demographics, ground truth collection, feature extraction, and classification models.
3.1. Study Design
Participants were recruited from the home institute through an email advertisement, following approval from the Institutional Review Board. A dedicated email was sent to the student community with information related to the study and the Google form to fill out for the interested participants. The participants filled out their demographic information, such as age, gender, current educational program, location of home residence, and preferred time slot. Additionally, the participants’ email and phone numbers were collected so the research assistant (RA) could contact them on the study day.
The day before the study, the RA sent an email to the participants to confirm their availability for a specific time slot. Further, a text message was sent one hour before the study session, confirming the location and time for participation. Three participants were invited to the lab for each study session. The RA ensured that the three participants were unfamiliar and did not know each other. The purpose of inviting three unfamiliar participants was to create a socially anxious situation during the study.
Upon arrival, the participants were seated around a rectangular table with rounded edges. Each session involved three participants and a RA. The participants were labeled as P1, P2, and P3 for each study session. The seating arrangement is shown in Figure 2: P1 was seated to the left of the RA, P2 was directly opposite the RA, and P3 was to the right of the RA, where P1 and P3 faced each other, while P2 faced the RA. This arrangement ensured that each participant faced an unfamiliar person, creating a socially anxious scenario. The RA explained the study to the participants and distributed the Participant Information Sheet (PIS) and the Informed Consent Form (ICF). After collecting the signed consent forms, the RA obtained permission to start camera recordings. Three individual smartphones, labeled C1, C2, and C3, with 13-megapixel back cameras were used to record P1, P2, and P3, respectively (see Figure 2). The “Background Video Recorder” app was selected for its ability to record video even with the screen off, which was not possible with the smartphone’s default camera app. The app was configured to use the highest possible sampling rate of 30 frames per second (FPS).
The RA instructed participants to remain seated for two minutes without interacting with others. Participants were free to look around but remained idle. After two minutes, the RA concluded the session, distributed a self-reported survey (discussed later), and recorded the session’s start and end times. Finally, the RA thanked the participants and provided refreshments as a token of appreciation for their time and participation.

3.2. Demographics
The participants in our study are the student population of the author’s institute. Most participants were male (#58, 63.74%), followed by female (#33, 36.26%). In terms of education status, 71.43% (#65) were undergraduate students, while the remaining 28.57% (#26) were graduate students. Regarding home location, 76.92% (#70) were from urban areas, and 23.08% (#21) were from rural areas. A detailed breakdown of the participants’ demographic information is provided in Table 1.
Category | Count | Percentage | Age | SR anxiety | |||
---|---|---|---|---|---|---|---|
(#) | (%) | ||||||
Gender | |||||||
Female | 33 | 36.26 | 20.94 | 2.73 | 3.27 | 1.13 | |
Male | 58 | 63.74 | 20.59 | 2.13 | 3.29 | 0.97 | |
Education | |||||||
Graduate | 26 | 28.57 | 23.69 | 2.04 | 3.42 | 1.06 | |
Undergraduate | 65 | 71.43 | 19.52 | 1.06 | 3.23 | 1.01 | |
Home Location | |||||||
Rural | 21 | 23.08 | 20.95 | 2.56 | 3.86 | 0.96 | |
Urban | 70 | 76.92 | 20.64 | 2.3 | 3.11 | 0.99 | |
Total | 91 | 100 | 20.71 | 2.35 | 3.28 | 1.02 |
3.3. Ground Truth
The self-reported survey collected during the study was used as the ground truth. The survey included a single question asking participants about their anxiety levels during the studys session (i.e., sitting idle for 2 minutes). Participants rated their anxiety on a Likert scale from 1 to 5, where: 1: Very nervous, 2: Somewhat nervous, 3: Neither relaxed nor nervous, 4: Somewhat relaxed, and 5: Very relaxed. The distribution of self-reported anxiety ratings is shown in Figure 3.
For the ground truth, participants who rated their anxiety as 1 or 2 were labeled as anxious, while those who rated their anxiety as 4 or 5 were labeled as non-anxious. A considerable number of participants rated their anxiety as 3, so instead of grouping these participants into either the anxious or non-anxious categories, they were labeled as neutral.

3.4. Feature Extraction
During data processing, the recorded videos had an average duration of 124.21 seconds and a mean sampling rate of 20.02 FPS. To ensure uniformity, we selected the first 90 seconds of each video for analysis, as participants generally exhibited reduced anxiety over time. Videos shorter than 90 seconds were excluded, while longer videos were trimmed to the initial 90 seconds. Data from six participants were lost due to issues such as delayed recording initiation by the research assistant or technical problems, including incomplete recordings caused by memory errors or full storage. Ultimately, data from 85 participants were retained for further analysis. This approach minimized data loss while maintaining a consistent dataset.
To extract features from participant’s videos, we used the open-source OpenFace222https://github.com/TadasBaltrusaitis/OpenFace Python toolkit (Baltrušaitis et al., 2016). OpenFace is a well-validated Python-based tool for facial analysis tasks and has been widely used in various behavioral studies, including depression and anxiety detection. The toolkit processes video inputs and generates time-series data consisting of 714 features. In our analysis, we excluded metadata information features (#5) and rigidity features (#40) due to limited relevance in existing literature and lack of interpretability. The remaining 669 features used in the analysis are described in Table 2.
Following methodologies proposed by Bhatti et al. (Bhatti et al., 2024) and Schmidt et al. (Schmidt et al., 2018), the data was prepared for classification through two key steps: (i) Chunking: The 669 OpenFace features were segmented using non-overlapping windows of 10 seconds to create data chunks. (ii) Flattening: For each chunk, the mean values of all features were computed along the time dimension, resulting in a data sample of size .
This process produced a dataset of 1,173 samples. Ground truth labels were then associated with the prepared dataset, resulting in the following class distribution: anxious (314 samples), neutral (384 samples), and non-anxious (475 samples). Additional details on sample distributions are provided in Table 3.
Feature | Feature Set | # | Description |
---|---|---|---|
Eye | gaze 2D landmarks 3D landmarks | 8 112 168 | Gaze represents the direction in which an individual is looking, using 3D vector world coordinates for both the left (gaze_0_x, gaze_0_y, gaze_0_z) and right (gaze_1_x, gaze_1_y, gaze_1_z) eyes. It also includes the gaze direction angle (gaze_angle_x, gaze_angle_y), averaged for both eyes, indicating whether a person looks left-right or up-down. Eye landmarks (pupil and eyelids) represent the position of landmarks around the eye region in both 2D (eye_lmk_x_0, eye_lmk_x_1,… eye_lmk_x55, eye_lmk_y_1,… eye_lmk_y_55) and 3D (eye_lmk_X_0, eye_lmk_X_1,… eye_lmk_X55, eye_lmk_Y_0,… eye_lmk_Z_55) coordinates. |
Head Pose | location rotation | 3 3 | Pose location (pose_Tx, pose_Ty, pose_Tz) represents the position of the head relative to the camera along the X, Y, and Z axes, while the pose angle (pose_Rx, pose_Ry, pose_Rz) represents the rotation of the head along these axes. The rotations are referred to as pitch (X axis), yaw (Y axis), and roll (Z axis). |
Face Landmarks | 2D landmarks 3D landmarks | 136 204 | Face landmarks represent 68 key positions on the face. These positions include the jawline (i.e., face edge) with 17 points (0 to 16), eyebrows for the left and right eyes with 10 points (17 to 26), the nose bridge and tip with 9 points (27 to 35), eyes (left and right) with 12 points (36 to 47), and the mouth, consisting of the upper lip, lower lip, and corners, with 20 points (48 to 67). The landmarks are present in both 2D (x_0, x_1, … x_66, x_67, y_0,…y_67) and 3D (X_0, … X_67, Y_0,…Y_67, Z_0,…Z_67) coordinates. |
Facial Action Units (AUs) | intensity presence | 17 18 | AUs represent facial expressions using the Facial Action Coding System based on muscle movements. AUs are described in terms of presence (AU#_c) and intensity (AU#_r). Presence indicates whether the AU is visible on the face, while intensity refers to how strong the AU is, measured on a scale of 1 to 5. |
Anxious (#) | Neutral (#) | Non-anxious (#) | |
Gender | |||
Female | 128 | 146 | 166 |
Male | 186 | 238 | 309 |
Education | |||
Graduate | 69 | 114 | 139 |
Undergraduate | 245 | 270 | 336 |
Home Location | |||
Rural | 29 | 65 | 163 |
Urban | 285 | 312 | 319 |
Total | 314 | 384 | 475 |
3.5. Anxiety Classification
3.5.1. Machine Learning (ML)
We used various classification models, each leveraging distinct strengths and methodologies, to identify the most effective model for anxiety classification using facial features. Logistic Regression (LR) (Hosmer Jr et al., 2013) was utilized for its simplicity and effectiveness in binary classification tasks. K-Nearest Neighbors (KNN) (Zhang, 2016) was included to assess the potential of proximity-based classification. Support Vector Machines (SVM) (Brereton and Lloyd, 2010) were selected for their capability to handle high-dimensional feature spaces effectively. Decision Trees (DT) (Song and Ying, 2015) offered interpretability, allowing for a better understanding of feature contributions, while Random Forests (RF) (Breiman, 2001) provided robustness against noise and the ability to capture complex feature interactions.
3.5.2. Deep Learning (DL)
We also applied deep learning models to the extracted facial features for anxiety classification. Specifically, we used a multilayer perceptron (MLP) (Alnuaim et al., 2022) and a one-dimensional convolutional neural network (1D CNN) (Kiranyaz et al., 2021). The reason for using deep learning models was their ability to learn complex patterns. However, we did not use other advanced deep learning models, as these require large amounts of data for training, and we had a limited dataset. Further, these models are computationally expensive, thus limiting the use case of our study. The MLP model used in this paper had an input layer connected to two dense layers with 64 and 32 neurons, followed by an output layer. Similarly, the 1D CNN had four convolutional layers with 64, 128, 256, and 128 neurons. The Adam optimizer was used for both models, with categorical cross-entropy as the loss function.
3.5.3. Ablation Studies
We conducted three ablation studies to analyze the performance and impact of different features and classification tasks: (i) Ablation Study 1: This study identified the most impactful feature for anxiety classification using Random Forest feature importance. (ii) Ablation Study 2: This study assessed the effectiveness of various feature categories (see Table 2) to determine which category contributed most significantly to the classification. (iii) Ablation Study 3: This study evaluated the classification model’s performance on three binary classification tasks: anxious vs. non-anxious, anxious vs. neutral, and neutral vs. non-anxious.
3.5.4. Evaluation
To evaluate the trained classification models, we used a 5-fold cross-validation approach (Kohavi et al., 1995). This approach ensures that the model is trained on different subsets of the data in each iteration and tested on unseen data, providing a more reliable and robust evaluation compared to a single train-test strategy. Further, to assess classification performance, we used evaluation metrics, including accuracy, precision, recall, F1-score, and area under the curve (AUC) (Sokolova and Lapalme, 2009). Accuracy, precision, and recall range from 0 to 100%, while the F1-score and AUC range from 0 to 1. Higher values indicate better performance. These metrics were computed for each fold and then averaged across all five folds.
4. Results
In this section, we present the outcomes of our analysis, including the predictive capabilities of the ML and DL classification models used for the multiclass problem using facial features. Additionally, we discuss the results of the ablation studies conducted in three different scenarios: (i) classification using different feature categories, (ii) classification using handcrafted features, and (iii) classification for binary classification problems.
4.1. Classification Performance
In our analysis, we used classical ML and DL classification models to evaluate the ability of AnxietyFaceTrack to detect anxiety in a lab setting without introducing additional anxiety-provoking situations. Table 4 presents the performance metrics for all the classification models used. Specifically, Table 4 shows each class’s average precision and recall to assess the model’s performance for each class. In the case of ML, we found that KNN and RF performed well, while LR showed the poorest performance in terms of accuracy. Further inspection of the other metrics revealed that RF and KNN achieved almost identical average precision and recall. However, when focusing on the “anxious” label, RF outperformed KNN, achieving higher average precision for this label.
Moreover, Table 4 provides additional insights. Firstly, RF outperformed DT across all metrics, suggesting that the single-tree structure of DT suffered from overfitting, whereas the ensemble approach of RF (using multiple trees) effectively handled high-dimensional data without overfitting. Secondly, RF outperformed LR on all metrics. This is likely because LR relies on linear decision boundaries, which failed to capture the complex and non-linear patterns in the data, while RF could handle these complexities. The overall best performance of RF in multiclass classification inspired us to conduct an ablation study to assess its effectiveness in binary classification scenarios (see Section 4.2.3), such as distinguishing between “anxious” and “non-anxious” participants.
In the case of DL models, the 1D CNN outperformed the MLP on most evaluation metrics. The 1D CNN achieved an average accuracy of 84%, compared to 80% for the MLP. Although these DL models performed better than most used ML models, they still lagged behind the RF and KNN. This could be due to the ability of machine learning models to learn effectively with smaller datasets, while deep learning models require larger amounts of data.
Given the overall best performance of RF across all evaluation metrics, we will now use the RF model in various ablation studies, which are discussed later in the paper.
Clf. | Acc. | Anxious | Neutral | Non-anx. | F1 Score | AUC |
(Pr., Re.) | (Pr., Re.) | (Pr., Re.) | ||||
LR | 0.65 | (0.64, 0.60) | (0.68, 0.69) | (0.65, 0.67) | 0.65 | 0.83 |
KNN | 0.86 | (0.81, 0.91) | (0.89, 0.84) | (0.89, 0.86) | 0.86 | 0.96 |
SVM | 0.73 | (0.68, 0.61) | (0.76, 0.76) | (0.73, 0.79) | 0.72 | 0.90 |
DT | 0.75 | (0.71, 0.70) | (0.75, 0.74) | (0.77, 0.79) | 0.74 | 0.81 |
RF | 0.88 | (0.86, 0.86) | (0.87, 0.88) | (0.90, 0.88) | 0.88 | 0.97 |
MLP | 0.80 | (0.75, 0.72) | (0.82, 0.87) | (0.81, 0.80) | 0.80 | 0.94 |
1D CNN | 0.84 | (0.81, 0.81) | (0.87, 0.81) | (0.83, 0.87) | 0.83 | 0.95 |
Feature | Feature Set | # | Acc. | Pr. | Re. | F1 | AUC | |
---|---|---|---|---|---|---|---|---|
AS 1 | ALL | Top 10% | 67 | 0.90 | 0.90 | 0.90 | 0.90 | 0.98 |
Top 20% | 134 | 0.91 | 0.90 | 0.91 | 0.90 | 0.98 | ||
AS 2 | Eye | gaze | 8 | 0.53 | 0.52 | 0.51 | 0.52 | 0.72 |
2D landmarks | 112 | 0.70 | 0.70 | 0.69 | 0.69 | 0.86 | ||
3D landmarks | 168 | 0.74 | 0.73 | 0.73 | 0.73 | 0.9 | ||
Combined | 0.79 | 0.79 | 0.79 | 0.79 | 0.93 | |||
Head Pose | location | 3 | 0.81 | 0.80 | 0.80 | 0.80 | 0.93 | |
rotation | 3 | 0.56 | 0.56 | 0.54 | 0.55 | 0.74 | ||
Combined | 0.85 | 0.85 | 0.85 | 0.85 | 0.96 | |||
Face Landmark | 2D landmarks | 136 | 0.81 | 0.81 | 0.81 | 0.81 | 0.94 | |
3D landmarks | 204 | 0.88 | 0.88 | 0.87 | 0.87 | 0.97 | ||
Combined | 0.87 | 0.87 | 0.87 | 0.87 | 0.97 | |||
Facial Action Units | intensity | 17 | 0.58 | 0.58 | 0.57 | 0.57 | 0.78 | |
presence | 18 | 0.59 | 0.6 | 0.57 | 0.58 | 0.76 | ||
Combined | 0.66 | 0.67 | 0.65 | 0.65 | 0.84 |
4.2. Ablation Studies
4.2.1. Ablation Study 1
In this analysis, we aimed to understand the role of feature importance in anxiety classification. We identified the most important features using the feature importance module from the Scikit-learn Random Forest implementation. We then trained and tested the model using k-fold cross-validation, incrementally selecting the top 10%, 20%, and up to 90% of the features ranked by importance (see Table 5). Our findings revealed that using only the top 10% of important features resulted in an accuracy of 90%, which was 2% higher than the accuracy achieved using all features. Increasing to the top 20% of important features further improved accuracy to 91%. Notably, the accuracy remained constant when using 30% and 40% of the features but began to decline after that, with a drop of just 1–2%. This analysis demonstrates that using only the top 67 features (10% of the total 669) achieves the highest accuracy, significantly reducing the model’s complexity while maintaining good performance (see Table 5). This finding underscores the effectiveness of feature selection in optimizing classification models in anxiety detection.
4.2.2. Ablation Study 2
In this ablation study, we evaluated each feature set listed in Table 2 to assess their ability in anxiety classification tasks. We selected RF as the classification model for this ablation study due to its superior performance compared to other models in previous analyses. The model was first trained and tested on individual feature sets and combinations of feature sets in individual facial regions, such as the combination of location and rotation in the head pose, etc. Table 5 summarizes the results obtained from 5-fold cross-validation, with the classification metrics averaged across all folds.
For individual feature sets, the highest accuracy (88%) was achieved using 3D facial landmarks, while the lowest scores were observed for eye gaze (53%) and pose rotation (56%) features. Among the combined feature sets based on facial regions, the highest accuracy was achieved for the “face landmark” (87%) region, followed by “head pose” (85%) with just a 2% difference in accuracy. In contrast, the lowest performance was recorded for facial action units (66%).
Interestingly, compared to the earlier analysis where all 669 features were used, the 3D landmarks achieved similar performance with just 204 features. Notably, the pose feature set, consisting of only six features, achieved 85% accuracy, which is just 3% lower than the highest-performing set. This highlights the potential of reduced feature sets for maintaining strong classification performance while minimizing computational complexity.
4.2.3. Ablation Study 3
In this analysis, we aimed to understand how the classification model performs in binary classification using facial features for anxiety and non-anxiety detection. We conducted binary classifications between “Anxious” and “Non-Anxious” by excluding the neutral participants. Similarly, we performed classifications for “Anxious” versus “Neutral” and “Neutral” versus “Non-Anxious”. Our results showed that the RF model outperformed other classification models. Table 6 presents the classification metrics obtained from binary classification across different feature sets. From Table 6, we identified several interesting observations that offer insights into comparing multiclass and binary classification and the utility of individual feature categories. First, we found that variations in accuracy and classification metrics in binary classification were consistent with those observed in multiclass classification for different feature sets. For instance, pose rotation features performed poorly across all binary classification cases but showed improved performance when combined with pose location features. This pattern was also observed in multiclass classification. Second, similar to multiclass classification, 3D facial landmarks were the most effective feature set for binary classification. This suggests that facial landmarks are crucial for detecting anxiety and non-anxiety. Third, the binary classification of “Neutral” versus “Non-Anxious” achieved the best performance, with the highest accuracy of 93%.
Feature | Feature Set | Anxious versus Neutral | Anxious versus non-Anxious | Neutral versus non-Anxious | ||||||||||||||
Acc. | Pr. | Re. | F1 | AUC | Acc. | Pr. | Re. | F1 | AUC | Acc. | Pr. | Re. | F1 | AUC | ||||
Eye | gaze | 0.68 | 0.66 | 0.62 | 0.64 | 0.74 | 0.65 | 0.57 | 0.49 | 0.52 | 0.68 | 0.65 | 0.68 | 0.7 | 0.69 | 0.73 | ||
2D landmarks | 0.81 | 0.8 | 0.78 | 0.79 | 0.91 | 0.78 | 0.72 | 0.72 | 0.72 | 0.87 | 0.76 | 0.78 | 0.79 | 0.79 | 0.85 | |||
3D landmarks | 0.83 | 0.8 | 0.81 | 0.81 | 0.91 | 0.86 | 0.84 | 0.79 | 0.81 | 0.93 | 0.82 | 0.83 | 0.85 | 0.84 | 0.9 | |||
Combined | 0.85 | 0.85 | 0.82 | 0.83 | 0.94 | 0.86 | 0.84 | 0.81 | 0.82 | 0.94 | 0.85 | 0.87 | 0.85 | 0.86 | 0.93 | |||
Head Pose | location | 0.88 | 0.87 | 0.85 | 0.86 | 0.95 | 0.87 | 0.85 | 0.84 | 0.84 | 0.94 | 0.86 | 0.87 | 0.88 | 0.88 | 0.93 | ||
rotation | 0.72 | 0.69 | 0.69 | 0.69 | 0.78 | 0.7 | 0.66 | 0.55 | 0.59 | 0.75 | 0.67 | 0.7 | 0.72 | 0.71 | 0.73 | |||
Combined | 0.91 | 0.92 | 0.87 | 0.89 | 0.98 | 0.88 | 0.88 | 0.83 | 0.85 | 0.96 | 0.9 | 0.9 | 0.91 | 0.91 | 0.97 | |||
Face Landmark | 2D landmarks | 0.88 | 0.88 | 0.84 | 0.86 | 0.96 | 0.86 | 0.82 | 0.82 | 0.82 | 0.94 | 0.86 | 0.88 | 0.86 | 0.87 | 0.94 | ||
3D landmarks | 0.91 | 0.9 | 0.89 | 0.9 | 0.97 | 0.9 | 0.89 | 0.87 | 0.88 | 0.97 | 0.92 | 0.91 | 0.94 | 0.93 | 0.97 | |||
Combined | 0.91 | 0.91 | 0.89 | 0.9 | 0.97 | 0.9 | 0.89 | 0.86 | 0.87 | 0.97 | 0.92 | 0.92 | 0.94 | 0.93 | 0.97 | |||
Facial Action Units | intensity | 0.73 | 0.71 | 0.67 | 0.69 | 0.8 | 0.72 | 0.71 | 0.51 | 0.59 | 0.8 | 0.66 | 0.68 | 0.75 | 0.71 | 0.74 | ||
presence | 0.73 | 0.73 | 0.65 | 0.69 | 0.79 | 0.72 | 0.72 | 0.5 | 0.59 | 0.79 | 0.73 | 0.73 | 0.81 | 0.77 | 0.77 | |||
Combined | 0.79 | 0.8 | 0.71 | 0.75 | 0.87 | 0.74 | 0.76 | 0.5 | 0.6 | 0.85 | 0.73 | 0.73 | 0.83 | 0.77 | 0.82 | |||
ALL | Top 10% | 0.91 | 0.92 | 0.87 | 0.9 | 0.98 | 0.9 | 0.89 | 0.86 | 0.87 | 0.97 | 0.92 | 0.93 | 0.93 | 0.93 | 0.97 | ||
Top 20% | 0.92 | 0.93 | 0.9 | 0.91 | 0.98 | 0.92 | 0.91 | 0.88 | 0.89 | 0.98 | 0.93 | 0.94 | 0.94 | 0.94 | 0.98 | |||
Top 20% | 0.92 | 0.92 | 0.9 | 0.91 | 0.98 | 0.91 | 0.9 | 0.89 | 0.89 | 0.98 | 0.93 | 0.94 | 0.94 | 0.94 | 0.98 |
4.3. Feature Importance in Anxiety Classification
In this work, we use facial video features for anxiety detection, and it is important to understand which facial features are linked to anxiety. To explain the results, we conducted a post-hoc analysis using SHapley Additive exPlanations (SHAP) (Lundberg, 2017) to examine the classification model. SHAP quantifies the contribution of each feature to the model’s prediction using game-theoretic Shapley values. Additionally, the Shapley values provide insights into how each feature affects the model’s decision-making. We will use a Random Forest model trained on the full feature set to identify the important features.
4.3.1. Multiclass
Figure 4 shows the top ten important features for multiclass classification. We observe that features like the face edge (Y_1, Y_4, Y_11, Y_15, Y_16, Z_39), right eye (eye_lmk_Y_43, gaze_1_y, gaze_1_z), and head position angle (pose_Rx) are significant for multiclass classification. Further, Figure 5 highlights how these top ten features from Figure 4 influence the classification for each class (Figure 5(a) for anxious, Figure 5(b) for neutral, and Figure 5(c) for non-anxious). From Figure 5(a), we notice that larger values of the face edge (Y_11, Y_16) and larger values of head position pitch (pose_Rx) push the model towards predicting the anxious class.
Figure 6 presents the top ten important features of individual classes that influence the model’s predictions. For instance, we observe that larger values of head position pitch (pose_Rx), face edge (Y_4, Y_16), and left eye landmark (x_3) push the model towards predicting the anxious class. On the other hand, smaller values of the remaining features (see Figure 6(a)) push the model away from predicting the anxious class. Similarly, we find that larger values of right eye gaze direction (gaze_1_z, gaze_1_y) and left cheek face contour (Y_1, Y_4) push the model towards predicting the neutral class (see Figure 6(b)). For the non-anxious class, smaller values of most features (see Figure 6(c)) push the model away from predicting the non-anxious class.







4.3.2. Binary Class
Figures 7(a), 7(b), and 7(c) show the top ten features for binary classification in three scenarios: anxious versus neutral, anxious versus non-anxious, and neutral versus non-anxious respectively. We observe that face edges contribute the most to classifications involving anxious versus non-anxious and anxious versus neutral. In contrast, eye-related features dominate in the classification of neutral versus non-anxious. Figures 7(d), 7(f), and 7(f) illustrate the direction of influence for the features shown in Figures 7(a), 7(b), and 7(c) respectively. Figure 7(d) shows the influence of features on the anxious class in the anxious versus neutral classification. We observe that larger values of eye landmarks (i.e., eye_lmk_x_38, eye_lmk_x_10) and larger face edge values (i.e., x_16, Y_4, x_4) push the model towards predicting the anxious class. Conversely, smaller values of other facial landmarks steer the model away from the anxious class, thereby favoring the neutral class. Figure 7(e) shows the influence of features for the anxious class in the anxious versus non-anxious classification. Larger values of facial landmarks, such as facial edge features (i.e., Y_0, x_13) and the right eyebrow (i.e., Z_25), push the model towards predicting the anxious class. In contrast, higher values of the remaining facial landmarks influence the model toward predicting the non-anxious class. Figure 7(f) focuses on the influence of features for the non-anxious class in the neutral versus non-anxious classification. Larger values of eye landmarks (i.e., eye_lmk_Y_51) and facial landmarks (i.e., Y_15, x_12) push the model towards predicting the non-anxious class. However, smaller values of eye gaze direction for the left and right eyes (i.e., gaze_0_y, gaze_0_z, gaze_1_y, gaze_1_z), smaller gaze angles (i.e., gaze_angle_y - looking up and down), and smaller left upper cheek values (i.e., Y_1, Y_3) steer the model away from the non-anxious class, favoring the neutral class instead.






4.4. Bias Investigation
Literature suggests that ML models can sometimes be biased due to factors such as gender, age, etc (Cheong et al., 2023; Chu et al., 2023). Moreover, in AnxietyFaceTrack, we use facial features, which can vary based on gender and age (Bannister et al., 2022). This highlights the need to assess biases in Random Forest model related to these factors. To investigate potential bias in our trained model, we checked if AnxietyFaceTrack suffers from any bias. We split our test data into two gender groups: Male and Female. For age, we categorized the test data into Undergrad and Graduate. Additionally, we examined bias based on participants’ home locations by dividing the test data into Rural and Urban groups.
Figure 8 shows the performance results of our Random Forest model, revealing several key observations. First, for gender (see Figure 8(a)), the classification metrics were higher for females, with a difference of 5-7% compared to males. This suggests that AnxietyFaceTrack works better for females, possibly because females tend to display emotions more prominently through facial expressions than males (Parkins, 2012; Fischer and LaFrance, 2015; Kring and Gordon, 1998). Second, for education level (see Figure 8(b)), we observed about a 10% difference in classification metrics between graduate and undergrad participants, indicating that the model was better at learning the more subtle behaviors of graduate participants. Lastly, for location (see Figure 8(c)), the model performed similarly for both rural and urban groups. However, precision was higher for the rural group. Upon closer inspection of the precision for individual labels, we found that precision for rural participants was above 95%, while for urban participants, it was around 85%. Our analysis of these biases aims to increase transparency in ML models for anxiety detection and provides valuable insights for future research on anxiety detection.



5. Discussion
In this work, we designed a study, AnxietyFaceTrack, to detect anxiety in participants using facial videos recorded during a social scenario. The study aims to contribute to developing unobtrusive mental health assessment tools. The results of our AnxietyFaceTrack study provide valuable insights into anxiety detection using facial features and machine learning models. Our analysis shows that the Random Forest model, trained on either all 669 features or only the 3D landmark features (#204), achieved the best overall classification performance, with an accuracy of 88%. Furthermore, in the ablation study, we found that using just the top 20% of important features determined by Random Forest feature importance yielded the highest accuracy of 91% for multiclass classification. Interestingly, we observed that using only six head pose features (i.e., location and rotation) achieved an average accuracy of 85%, which was just 6% lower than the best-performing feature set in correctly identifying anxious, neutral, and non-anxious states. For binary classification, the Random Forest model also performed best, achieving average accuracies of 92% for anxious versus neutral, 91% for anxious versus non-anxious, and 93% for neutral versus non-anxious. In summary, the multiclass and binary classification metrics of anxious, neutral, and non-anxious states are promising. Thus highlighting the potential of using facial features for accurate anxiety detection.
Using post hoc analysis, we obtained several key insights. First, we found that anxiety detection in both multiclass and binary classifications achieved similar performance metrics, suggesting that anxious participants can be effectively distinguished from non-anxious and neutral individuals when placed in socially anxious situations. Second, eye gaze performed the worst in both multiclass and binary classifications, even though it has been observed that socially anxious individuals tend to show avoidance gaze behavior compared to non-anxious individuals (Schneier et al., 2011; Chen et al., 2023). Third, individual feature sets struggled in multiclass classification but performed better in binary classification. For example, the action unit feature achieved only 59% accuracy in multiclass classification, but in binary classification, it achieved an average of 73% accuracy. This suggests there may be some overlap in facial expressions or movements between anxious, non-anxious, and neutral participants. Fourth, the 3D landmarks and the combined head pose feature set performed the best in both multiclass and binary classification tasks.
Post hoc analysis using SHAP plots provided several insights. For example, facial edges (i.e., facial landmarks 0, 1, 3, 4, 11, 12, 13, 15, 16) and eye landmarks were identified as important features in anxiety detection. Larger values of these features influenced the classification model toward the anxious class. Interestingly, eye gaze features favored the neutral class in both multiclass and binary classifications. Furthermore, similar to the findings of Nepal et al. (Nepal et al., 2024), we found that 3D landmarks and head pose performed the best compared to other feature sets for anxiety detection. This alignment suggests that these feature sets could be used to develop mental disorder assessment tools.
Furthermore, our investigation into biases in the anxiety classification model revealed several insights. Looking at the classification metrics, we found that the classification model performed better for female and graduate participants. It is important to note that the number of female participants was 36.26%, and graduate students made up 28.57%. Despite these proportions, the model was able to learn discriminating patterns more effectively for female and graduate participants. However, upon closer examination, we found that the precision of the anxious class was low for both male and female participants. The recall for the female anxious class was higher compared to males, suggesting that the model was better at capturing the anxious patterns in female participants. Another key finding was that the results were similar for rural and urban participants, even though rural participants made up only 23.08%. This suggests that, regardless of whether participants grew up in rural or urban areas, the model does not discriminate in favor of one group over the other. These findings on biases provide crucial insights for future research, particularly involving face features and machine learning in anxiety detection.
In conclusion, our AnxietyFaceTrack study, which uses facial features extracted from low-cost smartphone camera videos and machine learning for anxiety detection, shows promise as an unobtrusive and continuous approach to mental health assessment, specifically for detecting anxiety.
5.1. Implications
Early detection is crucial for enabling timely interventions and promoting recovery for mental disorders (Colizzi et al., 2020). This study used facial videos captured through a low-cost smartphone camera for anxiety detection. Our promising results highlight the potential of smartphone-recorded facial videos and machine learning for the early detection of anxiety. This innovative approach can complement existing mental health assessments. Although our study was conducted in a controlled setting, the results pave the way for future research to explore our methodology in real-world settings, leading to a better understanding of anxiety and its early detection in fully naturalistic settings.
Furthermore, our use of low-cost smartphone cameras opens the possibility for anxiety detection through facial features to be feasibly integrated into everyday settings. This technology has the potential to be incorporated into smartphones, enabling early detection and monitoring of anxiety. Extending this work could also assist mental health professionals in routine assessments and interventions, helping to reduce the mental healthcare gap, especially in low-income and developing countries.
5.2. Limitations
AnxietyFaceTrack study provides valuable insights into an unobtrusive mental health assessment tool for anxiety detection. However, it has certain limitations. First, the study was conducted in a controlled setting, which might limit the study findings in larger settings. However, the findings can serve as a baseline for future research conducted in either controlled or uncontrolled settings for anxiety detection. Second, the dataset size is limited due to the small number of participants. However, it is worth noting that machine-learning models were able to identify patterns associated with anxiety. Finally, our study used a self-reported survey questionnaire to create the ground truth, which may be subject to recall bias. Future studies could incorporate multiple questionnaires to reduce potential bias in participants’ responses.
6. Conclusion
Through the AnxietyFaceTrack study, we demonstrated the potential of leveraging facial videos recorded using low-cost smartphone cameras and machine learning to detect anxiety in unstaged social settings. Our findings contribute to the growing field of non-intrusive mental health assessment, offering a scalable and accessible solution that seamlessly integrates into everyday smartphone usage.
The study involved 91 participants, with facial videos recorded in a controlled environment simulating a social setting. Facial features extracted from these recordings were used to train both multiclass and binary classification models. The results were promising, with the multiclass model achieving an accuracy of 91% for distinguishing between anxious, neutral, and non-anxious states. Similarly, the binary classification models achieved accuracies of 92% for anxious versus neutral, 92% for anxious versus non-anxious, and 93% for neutral versus non-anxious comparisons. These outcomes underscore the feasibility of smartphone-based anxiety detection systems and their potential role in advancing personalized mental health care.
References
- (1)
- Men (2022) 2022. Mental disorders. https://www.who.int/news-room/fact-sheets/detail/mental-disorders. (Accessed on 12/31/2024).
- Alnuaim et al. (2022) Abeer Ali Alnuaim, Mohammed Zakariah, Prashant Kumar Shukla, Aseel Alhadlaq, Wesam Atef Hatamleh, Hussam Tarazi, R Sureshbabu, and Rajnish Ratna. 2022. Human-computer interaction for recognizing speech emotions using multilayer perceptron classifier. Journal of Healthcare Engineering 2022, 1 (2022), 6005446.
- Argyle (1978) Michael Argyle. 1978. Non-verbal communication and mental disorder1. Psychological Medicine 8, 4 (1978), 551–554.
- Asher et al. (2020) Maya Asher, Amitay Kauffmann, and Idan M Aderka. 2020. Out of sync: nonverbal synchrony in social anxiety disorder. Clinical Psychological Science 8, 2 (2020), 280–294.
- Baltrušaitis et al. (2016) Tadas Baltrušaitis, Peter Robinson, and Louis-Philippe Morency. 2016. Openface: an open source facial behavior analysis toolkit. In 2016 IEEE winter conference on applications of computer vision (WACV). IEEE, 1–10.
- Baltrušaitis (2019) Tadas Baltrušaitis. 2019. OpenFace Output Format. https://github.com/TadasBaltrusaitis/OpenFace/wiki/Output-Format?utm_source=chatgpt.com. (Accessed on 12/31/2024).
- Bannister et al. (2022) Jordan J Bannister, Hailey Juszczak, Jose David Aponte, David C Katz, P Daniel Knott, Seth M Weinberg, Benedikt Hallgrímsson, Nils D Forkert, and Rahul Seth. 2022. Sex differences in adult facial three-dimensional morphology: application to gender-affirming facial surgery. Facial Plastic Surgery & Aesthetic Medicine 24, S2 (2022), S–24.
- Bhatti et al. (2024) Anubhav Bhatti, Behnam Behinaein, Paul Hungler, and Ali Etemad. 2024. Attx: Attentive cross-connections for fusion of wearable signals in emotion recognition. ACM Transactions on Computing for Healthcare 5, 3 (2024), 1–24.
- Breiman (2001) Leo Breiman. 2001. Random forests. Machine learning 45 (2001), 5–32.
- Brereton and Lloyd (2010) Richard G Brereton and Gavin R Lloyd. 2010. Support vector machines for classification and regression. Analyst 135, 2 (2010), 230–267.
- Chand and Marwaha (2024) S. P. Chand and R. Marwaha. 2024. Anxiety (updated 2023 apr 24 ed.). StatPearls Publishing, Treasure Island (FL). Available from: https://www.ncbi.nlm.nih.gov/books/NBK470361/.
- Chen et al. (2023) Jiemiao Chen, Esther van den Bos, Julian D Karch, and P Michiel Westenberg. 2023. Social anxiety is related to reduced face gaze during a naturalistic social interaction. Anxiety, Stress, & Coping 36, 4 (2023), 460–474.
- Cheong et al. (2023) Jiaee Cheong, Selim Kuzucu, Sinan Kalkan, and Hatice Gunes. 2023. Towards Gender Fairness for Mental Health Prediction.. In IJCAI. 5932–5940.
- Chu et al. (2023) Charlene H Chu, Simon Donato-Woodger, Shehroz S Khan, Rune Nyrup, Kathleen Leslie, Alexandra Lyn, Tianyu Shi, Andria Bianchi, Samira Abbasgholizadeh Rahimi, and Amanda Grenier. 2023. Age-related bias and artificial intelligence: a scoping review. Humanities and Social Sciences Communications 10, 1 (2023), 1–17.
- Cohn et al. (2009) Jeffrey F Cohn, Tomas Simon Kruez, Iain Matthews, Ying Yang, Minh Hoai Nguyen, Margara Tejera Padilla, Feng Zhou, and Fernando De la Torre. 2009. Detecting depression from facial actions and vocal prosody. In 2009 3rd international conference on affective computing and intelligent interaction and workshops. IEEE, 1–7.
- Colizzi et al. (2020) Marco Colizzi, Antonio Lasalvia, and Mirella Ruggeri. 2020. Prevention and early intervention in youth mental health: is it time for a multidisciplinary and trans-diagnostic model for care? International journal of mental health systems 14 (2020), 1–14.
- Fischer and LaFrance (2015) Agneta Fischer and Marianne LaFrance. 2015. What drives the smile and the tear: Why women are more emotionally expressive than men. Emotion Review 7, 1 (2015), 22–29.
- Foley and Gentile (2010) Gretchen N Foley and Julie P Gentile. 2010. Nonverbal communication in psychotherapy. Psychiatry (Edgmont) 7, 6 (2010), 38.
- Gavrilescu and Vizireanu (2019) Mihai Gavrilescu and Nicolae Vizireanu. 2019. Predicting depression, anxiety, and stress levels from videos using the facial action coding system. Sensors 19, 17 (2019), 3693.
- Giannakakis et al. (2017) Giorgos Giannakakis, Matthew Pediaditis, Dimitris Manousos, Eleni Kazantzaki, Franco Chiarugi, Panagiotis G Simos, Kostas Marias, and Manolis Tsiknakis. 2017. Stress and anxiety detection using facial cues from videos. Biomedical Signal Processing and Control 31 (2017), 89–101.
- Gilboa-Schechtman and Shachar-Lavie (2013) Eva Gilboa-Schechtman and Iris Shachar-Lavie. 2013. More than a face: a unified theoretical perspective on nonverbal social cue processing in social anxiety. Frontiers in human neuroscience 7 (2013), 904.
- Grimm et al. (2022) Bradley Grimm, Brett Talbot, and Loren Larsen. 2022. PHQ-V/GAD-V: Assessments to Identify Signals of Depression and Anxiety from Patient Video Responses. Applied Sciences 12, 18 (2022), 9150.
- Harshit et al. (2024) Nandigramam Sai Harshit, Nilesh Kumar Sahu, and Haroon R Lone. 2024. Eyes Speak Louder: Harnessing Deep Features From Low-Cost Camera Video for Anxiety Detection. In Proceedings of the Workshop on Body-Centric Computing Systems. 23–28.
- Horigome et al. (2020) Toshiro Horigome, Brian Sumali, Momoko Kitazawa, Michitaka Yoshimura, Kuo-ching Liang, Yuki Tazawa, Takanori Fujita, Masaru Mimura, and Taishiro Kishimoto. 2020. Evaluating the severity of depressive symptoms using upper body motion captured by RGB-depth sensors and machine learning in a clinical interview setting: a preliminary study. Comprehensive psychiatry 98 (2020), 152169.
- Hosmer Jr et al. (2013) David W Hosmer Jr, Stanley Lemeshow, and Rodney X Sturdivant. 2013. Applied logistic regression. John Wiley & Sons.
- Kiranyaz et al. (2021) Serkan Kiranyaz, Onur Avci, Osama Abdeljaber, Turker Ince, Moncef Gabbouj, and Daniel J Inman. 2021. 1D convolutional neural networks and applications: A survey. Mechanical systems and signal processing 151 (2021), 107398.
- Kohavi et al. (1995) Ron Kohavi et al. 1995. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Ijcai, Vol. 14. Montreal, Canada, 1137–1145.
- Kring and Gordon (1998) Ann M Kring and Albert H Gordon. 1998. Sex differences in emotion: expression, experience, and physiology. Journal of personality and social psychology 74, 3 (1998), 686.
- Lundberg (2017) Scott Lundberg. 2017. A unified approach to interpreting model predictions. arXiv preprint arXiv:1705.07874 (2017).
- Mo et al. (2024) Haimiao Mo, Siu Cheung Hui, Xiao Liao, Yuchen Li, Wei Zhang, and Shuai Ding. 2024. A multimodal data-driven framework for anxiety screening. IEEE Transactions on Instrumentation and Measurement (2024).
- Nepal et al. (2024) Subigya Nepal, Arvind Pillai, Weichen Wang, Tess Griffin, Amanda C Collins, Michael Heinz, Damien Lekkas, Shayan Mirjafari, Matthew Nemesure, George Price, et al. 2024. MoodCapture: Depression Detection Using In-the-Wild Smartphone Images. In Proceedings of the CHI Conference on Human Factors in Computing Systems. 1–18.
- Parkins (2012) Róisín Parkins. 2012. Gender and emotional expressiveness: An analysis of prosodic features in emotional expression. Griffith University Nathan, QLD.
- Perez and Riggio (2003) John E Perez and Ronald E Riggio. 2003. Nonverbal social skills and psychopathology. Nonverbal behavior in clinical settings (2003), 17–44.
- Rashid et al. (2020) Haroon Rashid, Sanjana Mendu, Katharine E Daniel, Miranda L Beltzer, Bethany A Teachman, Mehdi Boukhechba, and Laura E Barnes. 2020. Predicting subjective measures of social anxiety from sparsely collected mobile sensor data. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 4, 3 (2020), 1–24.
- Ringeval et al. (2019) Fabien Ringeval, Björn Schuller, Michel Valstar, Nicholas Cummins, Roddy Cowie, Leili Tavabi, Maximilian Schmitt, Sina Alisamir, Shahin Amiriparian, Eva-Maria Messner, et al. 2019. AVEC 2019 workshop and challenge: state-of-mind, detecting depression with AI, and cross-cultural affect recognition. In Proceedings of the 9th International on Audio/visual Emotion Challenge and Workshop. 3–12.
- Ringeval et al. (2017) Fabien Ringeval, Björn Schuller, Michel Valstar, Jonathan Gratch, Roddy Cowie, Stefan Scherer, Sharon Mozgai, Nicholas Cummins, Maximilian Schmitt, and Maja Pantic. 2017. Avec 2017: Real-life depression, and affect recognition workshop and challenge. In Proceedings of the 7th annual workshop on audio/visual emotion challenge. 3–9.
- Salekin et al. (2018) Asif Salekin, Jeremy W Eberle, Jeffrey J Glenn, Bethany A Teachman, and John A Stankovic. 2018. A weakly supervised learning framework for detecting social anxiety and depression. Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies 2, 2 (2018), 1–26.
- Schmidt et al. (2018) Philip Schmidt, Attila Reiss, Robert Duerichen, Claus Marberger, and Kristof Van Laerhoven. 2018. Introducing wesad, a multimodal dataset for wearable stress and affect detection. In Proceedings of the 20th ACM international conference on multimodal interaction. 400–408.
- Schneier et al. (2011) Franklin R Schneier, Thomas L Rodebaugh, Carlos Blanco, Hillary Lewin, and Michael R Liebowitz. 2011. Fear and avoidance of eye contact in social anxiety disorder. Comprehensive psychiatry 52, 1 (2011), 81–87.
- Schroff et al. (2015) Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 815–823.
- Schuller et al. (2011) Björn Schuller, Michel Valstar, Florian Eyben, Gary McKeown, Roddy Cowie, and Maja Pantic. 2011. Avec 2011–the first international audio/visual emotion challenge. In Affective Computing and Intelligent Interaction: Fourth International Conference, ACII 2011, Memphis, TN, USA, October 9–12, 2011, Proceedings, Part II. Springer, 415–424.
- Shafique et al. (2022) Sara Shafique, Iftikhar Ahmed Khan, Sajid Shah, Waqas Jadoon, Rab Nawaz Jadoon, and Mohammed ElAffendi. 2022. Towards Automatic Detection of Social Anxiety Disorder via Gaze Interaction. Applied Sciences 12, 23 (2022), 12298.
- Shatz et al. (2024) Hallel Shatz, Roni Oren-Yagoda, and Idan M Aderka. 2024. Nonverbal synchrony in diagnostic interviews of individuals with social anxiety disorder. Journal of Anxiety Disorders 101 (2024), 102803.
- Sokolova and Lapalme (2009) Marina Sokolova and Guy Lapalme. 2009. A systematic analysis of performance measures for classification tasks. Information processing & management 45, 4 (2009), 427–437.
- Song and Ying (2015) Yan-Yan Song and LU Ying. 2015. Decision tree methods: applications for classification and prediction. Shanghai archives of psychiatry 27, 2 (2015), 130.
- Sun et al. (2022) Zhaodong Sun, Alexander Vedernikov, Virpi-Liisa Kykyri, Mikko Pohjola, Miriam Nokia, and Xiaobai Li. 2022. Estimating stress in online meetings by remote physiological signal and behavioral features. In Adjunct Proceedings of the 2022 ACM International Joint Conference on Pervasive and Ubiquitous Computing and the 2022 ACM International Symposium on Wearable Computers. 216–220.
- Szuhany and Simon (2022) Kristin L Szuhany and Naomi M Simon. 2022. Anxiety disorders: a review. Jama 328, 24 (2022), 2431–2445.
- Valstar et al. (2016) Michel Valstar, Jonathan Gratch, Björn Schuller, Fabien Ringeval, Denis Lalanne, Mercedes Torres Torres, Stefan Scherer, Giota Stratou, Roddy Cowie, and Maja Pantic. 2016. Avec 2016: Depression, mood, and emotion recognition workshop and challenge. In Proceedings of the 6th international workshop on audio/visual emotion challenge. 3–10.
- Valstar et al. (2014) Michel Valstar, Björn Schuller, Kirsty Smith, Timur Almaev, Florian Eyben, Jarek Krajewski, Roddy Cowie, and Maja Pantic. 2014. Avec 2014: 3d dimensional affect and depression recognition challenge. In Proceedings of the 4th international workshop on audio/visual emotion challenge. 3–10.
- Valstar et al. (2013) Michel Valstar, Björn Schuller, Kirsty Smith, Florian Eyben, Bihan Jiang, Sanjay Bilakhia, Sebastian Schnieder, Roddy Cowie, and Maja Pantic. 2013. Avec 2013: the continuous audio/visual emotion and depression recognition challenge. In Proceedings of the 3rd ACM international workshop on Audio/visual emotion challenge. 3–10.
- Wang et al. (2021) Chen Wang, Lizhong Liang, Xiaofeng Liu, Yao Lu, Jihong Shen, Hui Luo, and Wanqing Xie. 2021. Multimodal fusion diagnosis of depression and anxiety based on face video. In 2021 IEEE International Conference on Medical Imaging Physics and Engineering (ICMIPE). IEEE, 1–7.
- Waxer (1977) Peter H Waxer. 1977. Nonverbal cues for anxiety: an examination of emotional leakage. Journal of abnormal psychology 86, 3 (1977), 306.
- Weeks et al. (2011) Justin W Weeks, Richard G Heimberg, and Reinhardt Heuer. 2011. Exploring the role of behavioral submissiveness in social anxiety. Journal of Social and Clinical Psychology 30, 3 (2011), 217–249.
- Weeks et al. (2019) Justin W Weeks, Ashley N Howell, Akanksha Srivastav, and Philippe R Goldin. 2019. “Fear guides the eyes of the beholder”: Assessing gaze avoidance in social anxiety disorder via covert eye tracking of dynamic social stimuli. Journal of anxiety disorders 65 (2019), 56–63.
- Xiong et al. (2022) Peng Xiong, Min Liu, Bo Liu, and Brian J Hall. 2022. Trends in the incidence and DALYs of anxiety disorders at the global, regional, and national levels: Estimates from the Global Burden of Disease Study 2019. Journal of Affective Disorders 297 (2022), 83–93.
- Zhang (2016) Zhongheng Zhang. 2016. Introduction to machine learning: k-nearest neighbors. Annals of translational medicine 4, 11 (2016).