Gaze Perception in Humans and CNN-Based Model
Abstract
Making accurate inferences about other individuals’ locus of attention is essential for human social interactions and will be important for AI to effectively interact with humans. In this study, we compare how a CNN (convolutional neural network) based model of gaze and humans infer the locus of attention in images of real-world scenes with a number of individuals looking at a common location. We show that compared to the model, humans’ estimates of the locus of attention are more influenced by the context of the scene, such as the presence of the attended target and the number of individuals in the image.
1 Introduction
Humans develop as infants the ability to make inferences about other’s visual attention (Farroni et al., 2002). Most studies on social attention present a single face with a gaze cue to the observer (Bayliss et al., 2004; Driver et al., 1999; Friesen & Kingstone, 1998; Hietanen, 2002). The process might seem deceptively simple, but in real-life scenarios, humans need to integrate eye, head orientation, and body posture to make these judgments (Azarian et al., 2017; Frischen et al., 2007; Hietanen, 1999).
Estimating humans’ visual focus of attention is also important to allow for more natural human-computer interactions and have applications for early diagnosis of conditions that result in atypical social attention such as autism spectrum disorder (Senju & Johnson, 2009; Chawarska et al., 2003). There have been numerous developments in computer estimation of gaze or visual focus of attention from images or videos recently (Recasens, 2016; Kellnhofer et al., 2019; Chong et al., 2020; Fischer et al., 2018; Zhang et al., 2015).
Recent improvements in machine vision showed comparable performance in tasks such as visual recognition as humans. In addition, similar hierarchical processing of visual information between CNNs and human ventral visual cortex in tasks such as object and scene recognition further motivated comparisons between machine and humans (Devereux et al., 2018; Cichy et al., 2016; Serre, 2019). For example, studies have compared humans and deep neural networks across a variety of tasks, including contour detection, visual reasoning with synthetic tasks (Firestone, 2020), perception of object silhouettes (Pramod & Arun, ), abstraction and reasoning (Chollet, 2019), visual search with real-world scenes (Eckstein et al., 2017) or with medical images (Lago et al., 2021). However, little is known about how the current gaze estimation models perform compared to humans and their limitations in real-life scenarios.
Here we use a lab-created video dataset with ground-truth about the locus of attention to compare a state-of-the-art convolution neural network-based model (Chong et al., 2020) to human performance. We assess how humans and model errors vary with various image properties to understand the similarities and differences between humans and the model’s perception of the locus of attention. Specifically, we ask the following questions: 1. How accurate are humans and models in their estimations? 2. Are human and model errors correlated across images? 3. How does the number of individuals looking at a common spatial location affect humans and the model’s estimations? 4. How does the presence of the attended target affect humans and the model? 5. Are humans more consistent in making these estimations?
2 Methods
The stimuli used for evaluating human and model performance were image frames extracted from videos that were originally recorded at the University of California, Santa Barbara. Videos included indoor and outdoor scenes such as classrooms, outside campus buildings, dining halls, etc. In each video, multiple students followed an instruction to simultaneously look toward a designated target person. We refer those individuals in the video looking toward the target designated person as gaze-orienting individuals.
For each video clip, we extracted the image frame in which all the gaze-orienting individuals were looking at the designated target person. We manually segmented each individuals’ body outlines. We randomly selected some gaze-orienting individuals to be deleted during the gaze orienting process. We replaced the RGB values of pixels contained by the body outlines with the RGB values of the immediate background pixels, which can be extracted from other frames in the video. By changing the selected gaze-orienting individuals, we were able to create multiple (2-4) versions of the same image with different gaze-orienting individuals. In addition, we manipulated the number of gaze-orienting people present in the images (one, two, three gaze-orienting individuals). For each image stimuli, we also created a version with a designated target person present in the image (See Figure 1).

2.1 Human behavioral study
Two hundred participants (at least 18 years old) were recruited with a Human Intelligence Tasks(HIT) posted on Amazon Mechanical Turk. One hundred participants were assigned to the target-present condition, and the other hundred participants were assigned to the target-absent condition. All participants were asked to make an explicit perceptual estimation of the locus of gaze by clicking on the image where they thought all the gaze-orienting individuals were looking at. Sixty images extracted from 60 videos were presented in a random order to each subject (18 images had one gaze-orienting person, 18 images had two gaze-orienting people, 24 images had three gaze-orienting people). Subjects had unlimited viewing time and were able to adjust their selections before moving to the next image. Importantly, each participant only viewed one version of the image from the same video to prevent the memory from interfering with their judgment.
2.2 Model Implementation
We used a state-of-the-art model that was pre-trained on predicting attended targets in videos (Chong et al., 2020). The model requires the whole image of the scene and a cropped region of one person’s head as input. These images are passed to two convolution paths to extract features. For the head region, the model also creates an attention map. After combining scene features, head features, and the attention map, the model uses an Encoder, Conv-LSTM, and Deconv layer to produce a probability map to indicate the predicted attended location in the image. Note that the current version of the model only makes predictions on a single gaze-orienting person at a time.
3 Results
3.1 Data Analysis
For human estimations of the locus of attention, we used the mean x,y coordinates from all participants’ location selections. For the model’s estimation, we used the peak location of the probability map for each gaze-orienting individual in the image, and computed their mean location as the model estimation. Note that the current model only takes the original image and just one individual’s head region as input rather than multiple head regions at the same time. This is a common limitation in current models and could cause the difference between machines and humans in their ability to integrate context information to perform the task. (Chong et al., 2020). We also tested other methods to integrate the model estimates across gazing individuals, including the median and the intersection gaze direction vectors pointing to the maximum probabilities in the prediction maps. The mean estimation resulted in higher accuracy.
To calculate human and model estimation error, we calculated the Euclidean pixel distance between the human or model estimation to the designated target person’s head location. Section 3.2-3.3 presents data based on target-absent images, while section 3.4 presents data based on both target-absent images and target-present images.
3.2 Correlation between Humans and Model Perception
We assessed whether human estimations are more consistent with other humans than when compared to the model. Because most of the variance across images in the attended target location is in the horizontal direction (compared to the vertical direction; the ratio of variance = 18), we calculated the correlation of x-coordinate estimates between human pairs and between human and model pairings. We sampled 10,000 random human pairs and all possible pairings of humans and the model. The mean correlation between humans and the model was comparable to the mean correlation between pairs of humans (Appendix Figure 4).
We also evaluate whether humans tend to be more consistent in their overall estimations across images than the model. With 10,000 bootstrapped samples of human and model errors, we found a significantly smaller variance of human errors () compared to the model (), indicating humans have more consistent estimations across different images.
3.3 Effect of Number of Gaze Orienting Individuals on Estimation Accuracy
First, we conducted one-way ANOVA to test the effect of the number of looking individuals presented in the image on both humans and the model estimation error. We found a significant main effect of the number looking individuals on human estimation error , Tukey posthoc test showed a significantly larger error with only one looking individual compared to either images with two or three individuals (both ). In contrast, varying the number of looking individuals in the images did not significantly influence the model’s estimation error (Figure 2). A 3 (number of gaze orienting people) 2 (human vs. model) two-way ANOVA also confirmed a significant interaction between human and model . This result indicates that humans are able to integrate multiple gaze cues (individuals) to improve the estimation accuracy, while the model does not benefit from the additional information.

3.4 Effect of the Presence of Target
We evaluated the humans’ and the model’s estimation error in instances in which the designated ”looked at” target was present vs. absent. Figure 3a and 3b show the co-registered locus of attention estimations relative to the location of the target person (origin ), for humans (3a) and model (3b). Two-way ANOVA: estimation source (human vs. model) target presence condition (present vs. absent) showed a significant main effect of target presence condition , and a significant interaction between estimation source and target presence condition . A Tukey posthoc test showed that: 1. Both humans and the model had significantly lower errors when the target was present compared to when the target was absent (both ). 2. Humans had significantly higher errors compared to the model when the target was absent, , but had significantly lower errors when the target was present, . This indicated that the presence of the attended target in the image facilitates both humans and model to make more accurate judgments about the locus of attention, but with a much larger benefit for humans (Figure 3c).
Results also show a significantly higher amplitude of human estimation errors relative to the model in the vertical direction regardless of the presence of the attended target (Figure 3d, all ). For the errors in the horizontal direction, there was no significant difference between humans and the model when the target was absent (Figure 3e). However, when the attended target was present, humans errors in the horizontal direction were significantly smaller than the model, , indicating a strong influence from the presence of the attended target that the model did not show.

4 Conclusion
We found a significant correlation between humans and a state-of-the-art CNN-based model’s estimates of the locus of attention. Yet, we found some fundamental differences. Humans are more sensitive than the CNN-based model to the scene context. The presence of multiple individuals directing their gaze and the presence of the attended target in the scene reduced the estimation error more greatly for humans. The presence of the attended target person also facilitates humans more than the model in estimation accuracy. Together, the results suggest that humans rely more on multiple sources of scene information to infer social attention than current CNN models. The discrepancy might come from that most of the current models only utilize the whole image and one single head region to estimate gaze. Our findings suggest the necessity of developing models that can integrate more context information and multiple sources of cues in order to achieve better performance in gaze estimation.
Acknowledgments
The research was sponsored by the U.S. Army Research Office and was accomplished under Contract Number W911NF-19-D-0001 for the Institute for Collaborative Biotechnologies. MPE was supported by a Guggenheim Foundation Fellowship. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.
References
- Azarian et al. (2017) Bobby Azarian, George A. Buzzell, Elizabeth G. Esser, Alexander Dornstauder, and Matthew S. Peterson. Averted body postures facilitate orienting of the eyes. Acta Psychologica, 175:28–32, April 2017. ISSN 1873-6297. doi: 10.1016/j.actpsy.2017.02.006.
- Bayliss et al. (2004) Andrew P. Bayliss, Giuseppe di Pellegrino, and Steven P. Tipper. Orienting of attention via observed eye gaze is head-centred. Cognition, 94(1):B1–10, November 2004. ISSN 0010-0277. doi: 10.1016/j.cognition.2004.05.002.
- Chawarska et al. (2003) Katarzyna Chawarska, Ami Klin, and Fred Volkmar. Automatic attention cueing through eye movement in 2-year-old children with autism. Child Development, 74(4):1108–1122, August 2003. ISSN 0009-3920. doi: 10.1111/1467-8624.00595.
- Chollet (2019) François Chollet. On the Measure of Intelligence. arXiv:1911.01547 [cs], November 2019. URL http://arxiv.org/abs/1911.01547. arXiv: 1911.01547.
- Chong et al. (2020) Eunji Chong, Yongxin Wang, Nataniel Ruiz, and James M. Rehg. Detecting Attended Visual Targets in Video. arXiv:2003.02501 [cs], March 2020. URL http://arxiv.org/abs/2003.02501. arXiv: 2003.02501.
- Cichy et al. (2016) Radoslaw Martin Cichy, Aditya Khosla, Dimitrios Pantazis, Antonio Torralba, and Aude Oliva. Comparison of deep neural networks to spatio-temporal cortical dynamics of human visual object recognition reveals hierarchical correspondence. Scientific Reports, 6, June 2016. ISSN 2045-2322. doi: 10.1038/srep27755. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4901271/.
- Devereux et al. (2018) Barry J. Devereux, Alex Clarke, and Lorraine K. Tyler. Integrated deep visual and semantic attractor neural networks predict fMRI pattern-information along the ventral object processing pathway. Scientific Reports, 8(1):10636, July 2018. ISSN 2045-2322. doi: 10.1038/s41598-018-28865-1. URL https://www.nature.com/articles/s41598-018-28865-1. Number: 1 Publisher: Nature Publishing Group.
- Driver et al. (1999) Jon Driver, Greg Davis, Paola Ricciardelli, Polly Kidd, Emma Maxwell, and Simon Baron-cohen. Gaze perception triggers reflexive visuospatial orienting. Visual Cognition, pp. 509–540, 1999.
- Eckstein et al. (2017) Miguel P. Eckstein, Kathryn Koehler, Lauren E. Welbourne, and Emre Akbas. Humans, but Not Deep Neural Networks, Often Miss Giant Targets in Scenes. Current Biology, 27(18):2827–2832.e3, September 2017. ISSN 0960-9822. doi: 10.1016/j.cub.2017.07.068. URL https://www.sciencedirect.com/science/article/pii/S0960982217309727.
- Farroni et al. (2002) Teresa Farroni, Gergely Csibra, Francesca Simion, and Mark H. Johnson. Eye contact detection in humans from birth. Proceedings of the National Academy of Sciences, 99(14):9602 –9605, July 2002. doi: 10.1073/pnas.152159999. URL http://www.pnas.org/content/99/14/9602.abstract.
- Firestone (2020) Chaz Firestone. Performance vs. competence in human–machine comparisons. Proceedings of the National Academy of Sciences, 117(43):26562–26571, October 2020. ISSN 0027-8424, 1091-6490. doi: 10.1073/pnas.1905334117. URL https://www.pnas.org/content/117/43/26562. ISBN: 9781905334117 Publisher: National Academy of Sciences Section: Perspective.
- Fischer et al. (2018) Tobias Fischer, Hyung Jin Chang, and Yiannis Demiris. RT-GENE: Real-Time Eye Gaze Estimation in Natural Environments. pp. 334–352, 2018. URL https://openaccess.thecvf.com/content_ECCV_2018/html/Tobias_Fischer_RT-GENE_Real-Time_Eye_ECCV_2018_paper.html.
- Friesen & Kingstone (1998) Chris Kelland Friesen and Alan Kingstone. The eyes have it! Reflexive orienting is triggered by nonpredictive gaze. Psychonomic Bulletin & Review, 5(3):490–495, 1998. ISSN 1531-5320(Electronic),1069-9384(Print). doi: 10.3758/BF03208827. Place: US Publisher: Psychonomic Society.
- Frischen et al. (2007) Alexandra Frischen, Andrew P. Bayliss, and Steven P. Tipper. Gaze cueing of attention: visual attention, social cognition, and individual differences. Psychological Bulletin, 133(4):694–724, July 2007. ISSN 0033-2909. doi: 10.1037/0033-2909.133.4.694.
- Hietanen (1999) J. K. Hietanen. Does your gaze direction and head orientation shift my visual attention? Neuroreport, 10(16):3443–3447, November 1999. ISSN 0959-4965. doi: 10.1097/00001756-199911080-00033.
- Hietanen (2002) Jari K. Hietanen. Social attention orienting integrates visual information from head and body orientation. Psychological Research, 66(3):174–179, August 2002. ISSN 0340-0727. doi: 10.1007/s00426-002-0091-8.
- Kellnhofer et al. (2019) Petr Kellnhofer, Adria Recasens, Simon Stent, Wojciech Matusik, and Antonio Torralba. Gaze360: Physically Unconstrained Gaze Estimation in the Wild. pp. 6912–6921, 2019. URL https://openaccess.thecvf.com/content_ICCV_2019/html/Kellnhofer_Gaze360_Physically_Unconstrained_Gaze_Estimation_in_the_Wild_ICCV_2019_paper.html.
- Lago et al. (2021) Miguel A. Lago, Aditya Jonnalagadda, Craig K. Abbey, Bruno B. Barufaldi, Predrag R. Bakic, Andrew D. A. Maidment, Winifred K. Leung, Susan P. Weinstein, Brian S. Englander, and Miguel P. Eckstein. Under-exploration of Three-Dimensional Images Leads to Search Errors for Small Salient Targets. Current Biology, January 2021. ISSN 0960-9822. doi: 10.1016/j.cub.2020.12.029. URL https://www.sciencedirect.com/science/article/pii/S0960982220318819.
- (19) R. T. Pramod and S. P. Arun. Do computational models differ systematically from human object perception? In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1601–1609. doi: 10.1109/CVPR.2016.177. ISSN: 1063-6919.
- Recasens (2016) Adriá (Recasens Continente) Recasens. Where are they looking? Thesis, Massachusetts Institute of Technology, 2016. URL https://dspace.mit.edu/handle/1721.1/106078. Accepted: 2016-12-22T16:28:03Z.
- Senju & Johnson (2009) Atsushi Senju and Mark H. Johnson. Atypical eye contact in autism: Models, mechanisms and development. Neuroscience & Biobehavioral Reviews, 33(8):1204–1214, September 2009. ISSN 0149-7634. doi: 10.1016/j.neubiorev.2009.06.001. URL https://www.sciencedirect.com/science/article/pii/S0149763409000827.
- Serre (2019) Thomas Serre. Deep Learning: The Good, the Bad, and the Ugly. Annual Review of Vision Science, 5(1):399–426, 2019. doi: 10.1146/annurev-vision-091718-014951. URL https://doi.org/10.1146/annurev-vision-091718-014951. _eprint: https://doi.org/10.1146/annurev-vision-091718-014951.
- Zhang et al. (2015) X. Zhang, Y. Sugano, M. Fritz, and A. Bulling. Appearance-based gaze estimation in the wild. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4511–4520, June 2015. doi: 10.1109/CVPR.2015.7299081. ISSN: 1063-6919.
Appendix A Appendix
