Semi-Supervised Learning for Eye Image Segmentation
Abstract.
Recent advances in appearance-based models have shown improved eye tracking performance in difficult scenarios like occlusion due to eyelashes, eyelids or camera placement, and environmental reflections on the cornea and glasses. The key reason for the improvement is the accurate and robust identification of eye parts (pupil, iris, and sclera regions). The improved accuracy often comes at the cost of labeling an enormous dataset, which is complex and time-consuming. This work presents two semi-supervised learning frameworks to identify eye-parts by taking advantage of unlabeled images where labeled datasets are scarce. With these frameworks, leveraging the domain-specific augmentation and novel spatially varying transformations for image segmentation, we show improved performance on various test cases. For instance, for a model trained on just 48 labeled images, these frameworks achieved an improvement of 0.38% and 0.65% in segmentation performance over the baseline model, which is trained only with the labeled dataset.
1. Introduction
Effective gaze tracking can improve human-machine interactions by giving clues to users’ behaviors and intentions and enhancing the experience for virtual, augmented, and mixed reality devices with efficient and realistic renderings by supporting foveated rendering and multi-focal displays to minimize vergence-accommodation conflicts (Wu et al., 2019; Garbin et al., 2019). Recent advances in gaze tracking using appearance-based gaze estimation (Wu et al., 2019; Palmero et al., 2020a; Park et al., 2020) have shown robustness in person-independent scenarios, the effects of environmental reflections on the cornea and eyeglasses, occlusion due to eyelashes and camera placement, and heavy eye makeup. A backbone to most of the appearance-based models is the proper identification/segmentation of different parts of the eye. Recently, utilizing advancement in deep learning, Yiu et al. (2019); Chaudhary et al. (2019); Kothari et al. (2020a); Wu et al. (2019) proposed different models to segment various eye parts accurately in challenging cases.
The success of these methods is often predicated on the availability of large, curated training datasets of eye images with well-annotated labels. Such requirements are, however, difficult to satisfy in the general scenario. The acquisition process for the dataset for eye image segmentation requires a highly sophisticated and costly environment with significant (and often tedious) effort by experts. Even in such a scenario, the dataset’s quality may be limited by various factors like labeler’s bias and inconsistency (Kothari et al., 2020b). As such, the acquisition process is difficult, if not impossible, for academic labs. Only a handful of commercially funded labs have attempted to step in this direction of curating large datasets of eye image segmentation (Garbin et al., 2019; Palmero et al., 2020b). In this work, we utilize semi-supervised learning (SSL), which greatly diminishes the requirement for labeled data by leveraging relatively easy-to-acquire unlabeled datasets.
SSL leverages unlabeled data to improve the learning from a small labeled dataset (Zhou et al., 2003). The primary goal of SSL algorithms is to avoid over-fitting of the model’s parameters to a small amount of labeled data. Formally, in SSL, a data set is given among which only the first few instances (images) are annotated , and the remaining instances are unlabeled. While learning the function , SSL will exploit the hidden relationship within the data to predict the labels of unlabeled data points. One of the common inductive biases used to regularize SSL algorithms is the assumption of consistency of the network function. Consistency refers to the fact that data points or their representations should have the same label predictions even after perturbation. Various deep learning-based work used the idea of consistency to perturb either data (Laine and Aila, 2017) or their hidden representations (Gyawali et al., 2019), and constrain the label predictions to be similar. These algorithms have demonstrated improved generalization performance in various domains, like image classification in computer vision and medical imaging. However, in semantic segmentation, this simple yet effective assumption of consistency is violated. In segmentation, output or label space accounts for the spatial position of pixels in the input space. Thus, even with small perceptual perturbation in the input space, we cannot enforce consistency in the output space. Thus, the recent progress made in the SSL literature has been mostly confined to the classification task, and very few works have explored SSL for semantic segmentation (Ouali et al., 2020). For the application of eye segmentation, the use of SSL is further limited.
This work first presents the SSL approach to utilize consistency training for semantic segmentation of eye images. We use domain-specific augmentations that do not affect pixel positioning to perturb the input eye images and enforce consistency on the resulting model’s predictions. Following this, we present a novel SSL method that uses an idea from self-supervised learning (Kolesnikov et al., 2019) (formulate a new learning task) to strengthen the effect of regularization and improves generalization. This method also allows us to use commonly prescribed spatially varying augmentations like image rotation and translation while training the SSL method. We test these two SSL frameworks on one publicly available real eye segmentation dataset (OpenEDS-2019 (Garbin et al., 2019)). We compare the presented methods’ performance against the baseline, which includes training the model only with the labeled dataset. We further investigate the quality of the presented method for segmenting different eye parts, including iris and pupil. Finally, we present comparison studies on training the model with the labeled data from a single individual against a group of individuals on a fixed test set. In summary, the contributions of this work include:
-
•
Adaptation of SSL setup for segmenting eye regions with domain-specific augmentations.
-
•
A novel SSL framework to leverage spatially varying augmentations for semantic segmentation.
-
•
Demonstration that a small number of labeled images and a large number of unlabeled images can significantly improve eye image segmentation.
2. Related Work
A number of recent efforts have been made towards semantic segmentation of eye images to obtain the pupil, iris, and sclera regions (Wu et al., 2019; Kothari et al., 2020a; Chaudhary et al., 2019). Wu et al. (2019) considers multiple heterogeneous tasks, including semantic segmentation, related to gaze estimation and propose an end-to-end deep learning method. Similarly, Kothari et al. (2020a) proposed a technique to improve pupil/iris center detection using an ellipse segmentation instead of segmenting the visible eye parts, and Chaudhary et al. (2019) proposed a computationally efficient architecture to segment eye regions. Unlike those works that used large annotated image data sets to guide their learning process for segmentation of eye images, our work only considers a small amount of labeled data.
Semi-supervised learning (SSL) is one of the widely studied topics in machine learning. Consistency regularization is a commonly used regularization for SSL algorithms to exploit the hidden relationship between labeled and unlabeled data (Zhou et al., 2003). Laine and Aila (2017) and Gyawali et al. (2019) have used this idea of consistency in deep learning-based work and demonstrated improved performance in image classification tasks. However, these works cannot be applied to the task of semantic segmentation as they enforce consistency on the output space by perturbing the pixels in the input space. In recent years, limited attempts have been made toward learning of semi-supervised semantic segmentation (Ouali et al., 2020; Kalluri et al., 2019). For instance, Kalluri et al. (2019) used pixel-level entropy regularization to train their semantic segmentation architecture. While some other related work has explored SSL for eye image segmentation, those studies (e.g., (Shen et al., 2020)) considered extra information in the form of domain labels or considered a large number of labeled samples.

3. Methods
This section formulates the problem for the SSL framework. Following this we present two methods: SSL with domain-specific augmentation and a novel SSL with self-supervised learning.
3.1. Problem formulation
We consider a dataset with labeled training examples with the corresponding labels , and unlabeled training samples , where . Since our task is segmentation, the data space is represented by where is the spatial dimension, and is the number of input channels. Similarly, the label space is represented by where is the number of classes. We aim to learn parameters for the mapping function , approximated via a deep neural network.
3.2. SSL with domain-specific augmentation
Here, we present the SSL paradigm with domain-specific augmentation. To ensure effective utilization of consistency regularization, we employ augmentations without changing the image’s spatial positions. Prior works for SSL utilize variation in noise perturbations (Ouali et al., 2020). In our observation, most of the eye images from the same hardware setup vary mostly in eye shape, skin/iris pigmentation, lighting conditions, gaze direction, eye with respect to camera, and blinks. Among them, variation on contrast/ luminance of eye images are referred to as domain-specific augmentation in this paper. Techniques like Contrast Limited Adaptive histogram equalization (CLAHE) (Zuiderveld, 1994) and Gamma correction have been shown advantageous for eye parts segmentation (Chaudhary et al., 2019). We leverage these domain-specific augmentations to perform label guessing in our SSL paradigm.
3.2.1. Guessing Labels
Using the domain augmentation strategies, we create separate copies for each data point and estimate their labels, as shown schematically in Fig. 1 (a). We compute the average of the model’s prediction (softmax probabilities) for augmented copies of each data point as:
(1) |
Although the label guessing in this manner is commonly used for unlabeled datasets (Berthelot et al., 2019), we find combining both labeled and unlabeled datasets to be beneficial. As such, in our case, any unlabeled data point is a random sample from the combination of both datasets and . This way of guessing the label encourages the model to be consistent across different augmentations (Gyawali et al., 2020). Although following standard practice, we do not propagate gradients by computing the guessed labels; it should be noted that the guessed labels may change, as the segmentation network is updated over training.
3.2.2. SSL Objective
We now have the guessed labels for and true labels for . From , we randomly sample () as data-target pair to calculate supervised loss. The objective for supervised setup is , where , , and are, respectively, the cross-entropy loss, boundary aware loss (Ronneberger et al., 2015) and the surface loss (Kervadec et al., 2018). For the case of , we consider unsupervised loss (referred to hereafter as ), which is the L2 loss computed between the predicted softmax probabilities and the guessed label (), as it is less susceptible to wrong predictions (Berthelot et al., 2019; Laine and Aila, 2017). As shown in Fig. 1 (c), we sample from the combination of labeled and unlabeled datasets, and consider all the augmented copies (Fig. 1 (c) represents the case when ) to calculate . Since both data-point , and have the same guesses label , the minimization of unsupervised loss acts as the consistency regularization. The overall SSL loss is the combination of the and , with as the weighting term between two-loss terms: .
3.3. SSL with self-supervised learning
In this section, we propose a novel SSL method for the task of semantic segmentation utilizing the idea from self-supervised learning (Kolesnikov et al., 2019). The self-supervised learning uses unlabeled data to formulate a pretext learning task such as predicting context, for which a target objective can be computed without supervision (Kolesnikov et al., 2019). In our case, predicting labels from the inversion of the transformed image’s model prediction is the pretext learning task (Fig. 1 (b)). This also helps us to utilize widely used image transformations, including rotation and translation, which we were unable to use in Section 3.2. Note that these image transformations are important to account for variations in relative motion of the eye with respect to camera. We now present how we perform label guessing, which is the primary difference in comparison to .
3.3.1. Guessing Labels
We first consider the same domain augmentation strategies used in 3.2.1 to create separate copies for each data point . For each of these augmented data points, with different probability , and , we further introduce image transformation-based augmentation of rotation, and translation, respectively. Collectively, we will use to denote these image transformations. We pass them into our segmentation network to obtain the corresponding labels and then calculate the inverse transform of these labels to bring them back into the same spatial space where the initial data points reside. Finally, we compute the average of the inverse of the model’s prediction , as the guessed labels for all separate copies of data point :
(2) |
This is represented schematically in Fig. 1 (b), and since the model has to be consistent on the predictions being invariant to both domain-specific augmentation and image transformations, this method of guessing the labels adds a more potent form of regularization compared to the one we presented in 3.2.1.
3.3.2. SSL Objective
Similar to 3.2.2 and as shown schematically in Fig. 1 (c), we use the combination of supervised () and unsupervised (, ) loss for our objective function. is the L2 loss computed between the predicted softmax prediction of data from domain-specific augmentation and the guessed label from 3.2.1. is the L2 loss computed between the same data and the guessed label from 3.3.1. The overall objective function is , where and are the corresponding weighting terms.
4. Experiments
4.1. Datasets and Evaluation
We evaluated our proposed pipeline on the OpenEDS-2019 dataset. OpenEDS-2019 has well-defined train (), validation (2403), and test (1440) sets where labels for the test set are not provided for public use. We evaluate our models on the validation set. Qualitative results on the test set is provided as supplementary video. Note that OpenEDS-2019 contains eye images of 94/28 (train/validation) different participants with multiple images. Further description of splits and test cases are explained in Section 5. All images were resized to for faster computation. For each dataset, models are tested on the fixed test set. Evaluations of the segmentation performance are with the standard mean Intersection over Union (IoU) score (Everingham et al., 2005).
4.2. Implementation Details
As the primary purpose of the paper is designing an SSL paradigm to leverage the unlabeled dataset and not in designing a model architecture, we leverage the publicly available RITnet (Chaudhary et al., 2019) that is computationally efficient. Models are trained using Adam optimizer (Kingma and Ba, 2014) with a learning rate of 0.001 and a batch size of eight images (four labeled and four unlabeled) for 250 epochs. The model with the best validation score was reported.
In our experiments, we used , , . Two unsupervised loss hyper-parameters and are two linearly increasing weights with slope 0.02 and 0.002 per epoch respectively. Initially, the loss scheduling scheme gives prominence to the supervised loss, and after a few epochs, unsupervised loss starts to increase.
For domain-specific augmentation, we varied the contrast and luminance of the images. We selected random values in [0.8, 1.2] with step size of 0.05 for Gamma correction and random clip parameter in (1.0, 1.2, 1.5, 1.5, 1.5, 2.0) and grid size (2, 4, 8, 8, 8, 16) for CLAHE. For transform, random rotation [-5∘, 5∘] and translation [-20, 20 pixels] were performed with 50% and 80% probability respectively. Note the images in the dataset mostly varied in the translated form, so high importance was given to translation. Each image had 50% chance of being transformed with . Additionally, we also used basic augmentations proposed in (Chaudhary et al., 2019) for all images.
5. Results and Discussion
The analysis of the results is based on two essential questions we want to address. First, does the availability of a large unlabeled dataset assist the learning from limited labeled data? Within this, we further ask what is the minimum number of labeled images per subject we can use in our pipeline to improve the performance significantly? Second, will the model have better performance when the labeled images are from a single person, or when they are drawn from multiple subjects? To answer these questions, we split our analysis into two parts: training with a varying number of labeled samples from multiple subjects and from a single subject. Further, we also compare the segmentation performance for essential eye features such as pupil and iris as it is important for reliable gaze estimation.
5.1. Training with Multiple Subjects
In this setup, the models were trained on varying numbers of labeled images () and a fixed number of unlabeled images ( = 8916) from multiple subjects. The models were evaluated on a fixed test set (2403). The equal number of images are selected from the same subject when 94 and a random sample is chosen from different subjects for 94. In Fig. 2, the accuracy increases (88.28% 94.71%) as increases (4 940) when models are trained on labeled images only (). Similar behavior is observed with (92.42% 94.67%) framework. However, in all cases for 188, achieves improved performance compared to training with (e.g.,4.69% for =4 & 0.21% for =94). This demonstrate that SSL framework can be helpful for eye image segmentation. Furthermore, the improves the performance against (e.g.,4.83% for =4 & 0.02% for =94) and by further upto 0.33%. This improvement demonstrates effective utilization of unlabeled images. No clear benefits of SSL were observed for 188, which clearly signifies our proposed SSL frameworks are more suited when small amount of labeled data are available.
Most of the variation of model performance is expected in when number of is small. To verify this, we trained our model by randomly selecting different subjects for cases when is 4 and 12. The variations in the segmentation performance do not affect the incremental gain we observed from to and from to (Table 1).
4 images | 12 images | ||||||||
---|---|---|---|---|---|---|---|---|---|
Subset | % change | % change | |||||||
I | 87.42 | 91.59 | 92.04 | 5.28 | 91.78 | 93.21 | 93.36 | 1.72 | |
II | 88.28 | 92.42 | 92.54 | 4.83 | 91.52 | 92.81 | 93.05 | 1.67 |

5.2. Training with a Single Subject
Here, we consider the effect of training the model using from a single subject and a fixed from multiple subjects. For each dataset, we randomly selected two subjects (P1 and P2) for our analysis. For each subject, we use all the available samples as to train out model. The results are demonstrated in Fig. 3 (left and right), respectively, as percentage improvement as we go from to and from to . For P1, we further experiment using varying subsets of (4 and 12) from a single subject. Note that the addition of provided additional boost up to 1.31% compared to . The only cases where did not outperform was when trained on large samples of images of a single person (as shown in Fig. 3).
Compared to (Section 5.1), where we trained with multiple subjects, training with a single subject degrades the performance severely for (e.g. 88.28 % 83.32% when trained on ). We believe inherent variations among subjects helped while training with multiple subjects. However, in both SSL frameworks, with the addition of , this gap is reduced (e.g. 92.54 % 92.20% when trained on for ).

5.3. Eye part segmentation
In Table 2, we present the IoU score separately for the pupil and iris classes. Similar behavior is observed as Section 5.1 and 5.2, suggesting that every class’s accuracy is improved with these frameworks. Note that higher segmentation accuracy means a better chance of ellipse fits and thus accurate gaze estimation. In Fig. 4, we present a predicted segmentation mask for various test cases for qualitative comparison. Here, the spurious patches start to disappear as the number of labeled images are increased or with the introduction of unlabeled images with the SSL framework.
4 12 24 48 4 (P1) 12 (P1) 61 (P1) 88.25 (89.06) 91.84 (92.87) 92.31 (93.25) 92.60 (93.55) 88.00 (86.98) 87.83 (88.22) 90.92 (90.24) 91.97 (92.75) 92.50 (94.00) 92.69 (94.01) 92.69 (94.02) 91.73 (92.84) 91.63 (93.17) 92.67 (93.77) 91.85 (93.05) 92.62 (94.14) 92.81 (94.17) 93.01 (94.28) 91.87 (93.35) 91.78 (93.62) 92.86 (93.93) % change 4.08 (4.48) 0.85 (1.37) 0.54 (0.99) 0.44 (0.78) 4.4.40 (7.32) 4.50 (6.12) 2.13 (4.09)

6. Conclusion and Future Work
We present two SSL frameworks that leverage the domain-specific augmentations and pretext learning task that accounts for spatially varying transformations. With this, we demonstrate a substantial increase in segmentation performance for a small number of labeled images by considering hidden relationships present in a large number of unlabeled images. The efficacy of the two frameworks is demonstrated on OpenEDS-2019 datasets. This paper opens an exciting area for the eye-tracking community to focus on the variability of the subjects rather than labeling a large number of images of a particular subject. In the future, we would like to investigate the effect of curating labeled datasets (e.g., considering eye position and blinks) instead of random selection in the SSL framework. Pre-trained models and source code will be made available.
References
- (1)
- Berthelot et al. (2019) David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin A Raffel. 2019. Mixmatch: A holistic approach to semi-supervised learning. In Advances in Neural Information Processing Systems. 5049–5059.
- Chaudhary et al. (2019) Aayush K Chaudhary, Rakshit Kothari, Manoj Acharya, Shusil Dangi, Nitinraj Nair, Reynold Bailey, Christopher Kanan, Gabriel Diaz, and Jeff B Pelz. 2019. RITnet: real-time semantic segmentation of the eye for gaze tracking. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW). IEEE, 3698–3702.
- Everingham et al. (2005) Mark Everingham, Andrew Zisserman, Christopher KI Williams, Luc Van Gool, Moray Allan, Christopher M Bishop, Olivier Chapelle, Navneet Dalal, Thomas Deselaers, Gyuri Dorkó, et al. 2005. The 2005 pascal visual object classes challenge. In Machine Learning Challenges Workshop. Springer, 117–176.
- Garbin et al. (2019) Stephan J Garbin, Yiru Shen, Immo Schuetz, Robert Cavin, Gregory Hughes, and Sachin S Talathi. 2019. Openeds: Open eye dataset. arXiv preprint arXiv:1905.03702 (2019).
- Gyawali et al. (2020) Prashnna Kumar Gyawali, Sandesh Ghimire, Pradeep Bajracharya, Zhiyuan Li, and Linwei Wang. 2020. Semi-supervised Medical Image Classification with Global Latent Mixing. arXiv preprint arXiv:2005.11217 (2020).
- Gyawali et al. (2019) Prashnna Kumar Gyawali, Zhiyuan Li, Sandesh Ghimire, and Linwei Wang. 2019. Semi-supervised learning by disentangling and self-ensembling over stochastic latent space. In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 766–774.
- Kalluri et al. (2019) Tarun Kalluri, Girish Varma, Manmohan Chandraker, and CV Jawahar. 2019. Universal semi-supervised semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision. 5259–5270.
- Kervadec et al. (2018) Hoel Kervadec, Jihene Bouchtiba, Christian Desrosiers, Éric Granger, Jose Dolz, and Ismail Ben Ayed. 2018. Boundary loss for highly unbalanced segmentation. http://arxiv.org/abs/1812.07032
- Kingma and Ba (2014) Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. Journal of neuroscience methods 148, 2 (12 2014), 167–76. https://doi.org/10.1016/j.jneumeth.2005.04.009
- Kolesnikov et al. (2019) Alexander Kolesnikov, Xiaohua Zhai, and Lucas Beyer. 2019. Revisiting self-supervised visual representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 1920–1929.
- Kothari et al. (2020b) Rakshit Kothari, Zhizhuo Yang, Christopher Kanan, Reynold Bailey, Jeff B Pelz, and Gabriel J Diaz. 2020b. Gaze-in-wild: A dataset for studying eye and head coordination in everyday activities. Scientific reports 10, 1 (2020), 1–18.
- Kothari et al. (2020a) Rakshit S Kothari, Aayush K Chaudhary, Reynold J Bailey, Jeff B Pelz, and Gabriel J Diaz. 2020a. EllSeg: An Ellipse Segmentation Framework for Robust Gaze Tracking. arXiv preprint arXiv:2007.09600 (2020).
- Laine and Aila (2017) Samuli Laine and Timo Aila. 2017. Temporal ensembling for semi-supervised learning. In ICLR.
- Ouali et al. (2020) Yassine Ouali, Céline Hudelot, and Myriam Tami. 2020. Semi-Supervised Semantic Segmentation with Cross-Consistency Training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12674–12684.
- Palmero et al. (2020a) Cristina Palmero, Oleg V Komogortsev, and Sachin S Talathi. 2020a. Benefits of temporal information for appearance-based gaze estimation. arXiv preprint arXiv:2005.11670 (2020).
- Palmero et al. (2020b) Cristina Palmero, Abhishek Sharma, Karsten Behrendt, Kapil Krishnakumar, Oleg V Komogortsev, and Sachin S Talathi. 2020b. OpenEDS2020: Open Eyes Dataset. arXiv preprint arXiv:2005.03876 (2020).
- Park et al. (2020) Seonwook Park, Emre Aksan, Xucong Zhang, and Otmar Hilliges. 2020. Towards End-to-end Video-based Eye-Tracking. In European Conference on Computer Vision. Springer, 747–763.
- Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. International Conference on Medical image computing and computer-assisted intervention. Springer, Cham, 2015. 9351 (2015), 234–241. https://doi.org/10.1007/978-3-319-24574-4{_}28
- Shen et al. (2020) Yiru Shen, Oleg Komogortsev, and Sachin S Talathi. 2020. Domain Adaptation for Eye Segmentation. In European Conference on Computer Vision. Springer, 555–569.
- Wu et al. (2019) Zhengyang Wu, Srivignesh Rajendran, Tarrence Van As, Vijay Badrinarayanan, and Andrew Rabinovich. 2019. EyeNet: A Multi-Task Deep Network for Off-Axis Eye Gaze Estimation. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW). IEEE, 3683–3687.
- Yiu et al. (2019) Yuk-Hoi Yiu, Moustafa Aboulatta, Theresa Raiser, Leoni Ophey, Virginia L Flanagin, Peter zu Eulenburg, and Seyed-Ahmad Ahmadi. 2019. DeepVOG: Open-source pupil segmentation and gaze estimation in neuroscience using deep learning. Journal of neuroscience methods 324 (2019), 108307.
- Zhou et al. (2003) Dengyong Zhou, Olivier Bousquet, Thomas Lal, Jason Weston, and Bernhard Schölkopf. 2003. Learning with local and global consistency. Advances in neural information processing systems 16 (2003), 321–328.
- Zuiderveld (1994) Karel Zuiderveld. 1994. Contrast Limited Adaptive Histogram Equalization. In Graphics Gems. https://doi.org/10.1016/b978-0-12-336156-1.50061-6