This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Semi-Supervised Learning for Eye Image Segmentation

Aayush K. Chaudhary akc5959@rit.edu 1234-5678-9012 Prashnna K. Gyawali pkg2182@rit.edu Rochester Institute of Technology54 Lomb Memorial DriveRochesterNew YorkUSA14623 Linwei Wang Rochester Institute of Technology54 Lomb Memorial DriveRochesterUSA larst@affiliation.org  and  Jeff B. Pelz Rochester Institute of Technology54 Lomb Memorial DriveRochesterUSA
(2018)
Abstract.

Recent advances in appearance-based models have shown improved eye tracking performance in difficult scenarios like occlusion due to eyelashes, eyelids or camera placement, and environmental reflections on the cornea and glasses. The key reason for the improvement is the accurate and robust identification of eye parts (pupil, iris, and sclera regions). The improved accuracy often comes at the cost of labeling an enormous dataset, which is complex and time-consuming. This work presents two semi-supervised learning frameworks to identify eye-parts by taking advantage of unlabeled images where labeled datasets are scarce. With these frameworks, leveraging the domain-specific augmentation and novel spatially varying transformations for image segmentation, we show improved performance on various test cases. For instance, for a model trained on just 48 labeled images, these frameworks achieved an improvement of 0.38% and 0.65% in segmentation performance over the baseline model, which is trained only with the labeled dataset.

semi-supervised learning, segmentation, eye-segmentation, eye-tracking, gaze-tracking, AR/VR
copyright: acmcopyrightjournalyear: 2018doi: 10.1145/1122445.1122456conference: review; NY; price: 15.00isbn: 978-1-4503-XXXX-X/18/06ccs: Computing methodologies Semi-supervised learning settingsccs: Computing methodologies Image segmentationccs: Computing methodologies Unsupervised learning

1. Introduction

Effective gaze tracking can improve human-machine interactions by giving clues to users’ behaviors and intentions and enhancing the experience for virtual, augmented, and mixed reality devices with efficient and realistic renderings by supporting foveated rendering and multi-focal displays to minimize vergence-accommodation conflicts (Wu et al., 2019; Garbin et al., 2019). Recent advances in gaze tracking using appearance-based gaze estimation (Wu et al., 2019; Palmero et al., 2020a; Park et al., 2020) have shown robustness in person-independent scenarios, the effects of environmental reflections on the cornea and eyeglasses, occlusion due to eyelashes and camera placement, and heavy eye makeup. A backbone to most of the appearance-based models is the proper identification/segmentation of different parts of the eye. Recently, utilizing advancement in deep learning,  Yiu et al. (2019); Chaudhary et al. (2019); Kothari et al. (2020a); Wu et al. (2019) proposed different models to segment various eye parts accurately in challenging cases.

The success of these methods is often predicated on the availability of large, curated training datasets of eye images with well-annotated labels. Such requirements are, however, difficult to satisfy in the general scenario. The acquisition process for the dataset for eye image segmentation requires a highly sophisticated and costly environment with significant (and often tedious) effort by experts. Even in such a scenario, the dataset’s quality may be limited by various factors like labeler’s bias and inconsistency (Kothari et al., 2020b). As such, the acquisition process is difficult, if not impossible, for academic labs. Only a handful of commercially funded labs have attempted to step in this direction of curating large datasets of eye image segmentation (Garbin et al., 2019; Palmero et al., 2020b). In this work, we utilize semi-supervised learning (SSL), which greatly diminishes the requirement for labeled data by leveraging relatively easy-to-acquire unlabeled datasets.

SSL leverages unlabeled data to improve the learning from a small labeled dataset (Zhou et al., 2003). The primary goal of SSL algorithms is to avoid over-fitting of the model’s parameters to a small amount of labeled data. Formally, in SSL, a data set 𝒳={x1,x2,,xn}\mathcal{X}=\{x_{1},x_{2},...,x_{n}\} is given among which only the first few kk instances (images) are annotated {y1,,yk}𝒴\{y_{1},...,y_{k}\}\in\mathcal{Y}, and the remaining instances are unlabeled. While learning the function f:𝒳𝒴f:\mathcal{X}\rightarrow\mathcal{Y}, SSL will exploit the hidden relationship within the data to predict the labels of unlabeled data points. One of the common inductive biases used to regularize SSL algorithms is the assumption of consistency of the network function. Consistency refers to the fact that data points or their representations should have the same label predictions even after perturbation. Various deep learning-based work used the idea of consistency to perturb either data (Laine and Aila, 2017) or their hidden representations (Gyawali et al., 2019), and constrain the label predictions to be similar. These algorithms have demonstrated improved generalization performance in various domains, like image classification in computer vision and medical imaging. However, in semantic segmentation, this simple yet effective assumption of consistency is violated. In segmentation, output or label space accounts for the spatial position of pixels in the input space. Thus, even with small perceptual perturbation in the input space, we cannot enforce consistency in the output space. Thus, the recent progress made in the SSL literature has been mostly confined to the classification task, and very few works have explored SSL for semantic segmentation (Ouali et al., 2020). For the application of eye segmentation, the use of SSL is further limited.

This work first presents the SSL approach to utilize consistency training for semantic segmentation of eye images. We use domain-specific augmentations that do not affect pixel positioning to perturb the input eye images and enforce consistency on the resulting model’s predictions. Following this, we present a novel SSL method that uses an idea from self-supervised learning (Kolesnikov et al., 2019) (formulate a new learning task) to strengthen the effect of regularization and improves generalization. This method also allows us to use commonly prescribed spatially varying augmentations like image rotation and translation while training the SSL method. We test these two SSL frameworks on one publicly available real eye segmentation dataset (OpenEDS-2019 (Garbin et al., 2019)). We compare the presented methods’ performance against the baseline, which includes training the model only with the labeled dataset. We further investigate the quality of the presented method for segmenting different eye parts, including iris and pupil. Finally, we present comparison studies on training the model with the labeled data from a single individual against a group of individuals on a fixed test set. In summary, the contributions of this work include:

  • Adaptation of SSL setup for segmenting eye regions with domain-specific augmentations.

  • A novel SSL framework to leverage spatially varying augmentations for semantic segmentation.

  • Demonstration that a small number of labeled images and a large number of unlabeled images can significantly improve eye image segmentation.

2. Related Work

A number of recent efforts have been made towards semantic segmentation of eye images to obtain the pupil, iris, and sclera regions (Wu et al., 2019; Kothari et al., 2020a; Chaudhary et al., 2019). Wu et al. (2019) considers multiple heterogeneous tasks, including semantic segmentation, related to gaze estimation and propose an end-to-end deep learning method. Similarly, Kothari et al. (2020a) proposed a technique to improve pupil/iris center detection using an ellipse segmentation instead of segmenting the visible eye parts, and Chaudhary et al. (2019) proposed a computationally efficient architecture to segment eye regions. Unlike those works that used large annotated image data sets to guide their learning process for segmentation of eye images, our work only considers a small amount of labeled data.

Semi-supervised learning (SSL) is one of the widely studied topics in machine learning. Consistency regularization is a commonly used regularization for SSL algorithms to exploit the hidden relationship between labeled and unlabeled data (Zhou et al., 2003). Laine and Aila (2017) and Gyawali et al. (2019) have used this idea of consistency in deep learning-based work and demonstrated improved performance in image classification tasks. However, these works cannot be applied to the task of semantic segmentation as they enforce consistency on the output space by perturbing the pixels in the input space. In recent years, limited attempts have been made toward learning of semi-supervised semantic segmentation (Ouali et al., 2020; Kalluri et al., 2019). For instance, Kalluri et al. (2019) used pixel-level entropy regularization to train their semantic segmentation architecture. While some other related work has explored SSL for eye image segmentation, those studies (e.g., (Shen et al., 2020)) considered extra information in the form of domain labels or considered a large number of labeled samples.

Refer to caption
Figure 1. Schematic diagram of the proposed SSL methods. For each unlabeled data, labels are guessed for AA separate copies generated via (a) SSL with domain-specific augmentation and (b) SSL with a self-supervised approach. In (c), supervised loss and unsupervised loss are computed separately for labeled and a combination of labeled and unlabeled data set in both types of SSL methods.

3. Methods

This section formulates the problem for the SSL framework. Following this we present two methods: SSL with domain-specific augmentation and a novel SSL with self-supervised learning.

3.1. Problem formulation

We consider a dataset with kk labeled training examples 𝒳l\mathcal{X}_{l} with the corresponding labels 𝒴l\mathcal{Y}_{l}, and mm unlabeled training samples 𝒳u\mathcal{X}_{u}, where k<mk<m. Since our task is segmentation, the data space is represented by 𝒳CxHxW\mathcal{X}\in\mathbb{R}^{CxHxW} where HxWHxW is the spatial dimension, and CC is the number of input channels. Similarly, the label space is represented by 𝒴PxHxW\mathcal{Y}\in\mathbb{R}^{PxHxW} where PP is the number of classes. We aim to learn parameters θ\theta for the mapping function f:𝒳𝒴f:\mathcal{X}\rightarrow\mathcal{Y}, approximated via a deep neural network.

3.2. 𝒮𝒮D:\mathcal{SSL}_{D}: SSL with domain-specific augmentation

Here, we present the SSL paradigm with domain-specific augmentation. To ensure effective utilization of consistency regularization, we employ augmentations without changing the image’s spatial positions. Prior works for SSL utilize variation in noise perturbations (Ouali et al., 2020). In our observation, most of the eye images from the same hardware setup vary mostly in eye shape, skin/iris pigmentation, lighting conditions, gaze direction, eye with respect to camera, and blinks. Among them, variation on contrast/ luminance of eye images are referred to as domain-specific augmentation in this paper. Techniques like Contrast Limited Adaptive histogram equalization (CLAHE) (Zuiderveld, 1994) and Gamma correction have been shown advantageous for eye parts segmentation (Chaudhary et al., 2019). We leverage these domain-specific augmentations to perform label guessing in our SSL paradigm.

3.2.1. Guessing Labels

Using the domain augmentation strategies, we create AA separate copies for each data point and estimate their labels, as shown schematically in Fig. 1 (a). We compute the average of the model’s prediction yuy_{u} (softmax probabilities) for augmented copies of each data point xux_{u} as:

(1) yu=1Aa=1Af(xu,a;θ)y_{u}=\frac{1}{A}\sum_{a=1}^{A}f(x_{u,a};\theta)

Although the label guessing in this manner is commonly used for unlabeled datasets (Berthelot et al., 2019), we find combining both labeled and unlabeled datasets to be beneficial. As such, in our case, any unlabeled data point xux_{u} is a random sample from the combination of both datasets 𝒳l\mathcal{X}_{l} and 𝒳u\mathcal{X}_{u}. This way of guessing the label encourages the model to be consistent across different augmentations (Gyawali et al., 2020). Although following standard practice, we do not propagate gradients by computing the guessed labels; it should be noted that the guessed labels may change, as the segmentation network ff is updated over training.

3.2.2. SSL Objective

We now have the guessed labels for 𝒳u\mathcal{X}_{u} and true labels for 𝒳l\mathcal{X}_{l}. From 𝒳l\mathcal{X}_{l}, we randomly sample (xl1,yl1x_{l1},y_{l1}) as data-target pair to calculate supervised loss. The objective for supervised setup is sup=CEL(λ1+λ2BAL)+λ3SL\mathcal{L}_{sup}=\mathcal{L}_{CEL}(\lambda_{1}+\lambda_{2}\mathcal{L}_{BAL})+\lambda_{3}\mathcal{L}_{SL}, where CEL\mathcal{L}_{CEL}, BAL\mathcal{L}_{BAL}, and SL\mathcal{L}_{SL} are, respectively, the cross-entropy loss, boundary aware loss (Ronneberger et al., 2015) and the surface loss (Kervadec et al., 2018). For the case of 𝒳u\mathcal{X}_{u}, we consider unsupervised loss (referred to hereafter as u\mathcal{L}_{u}), which is the L2 loss computed between the predicted softmax probabilities and the guessed label (yuy_{u}), as it is less susceptible to wrong predictions  (Berthelot et al., 2019; Laine and Aila, 2017). As shown in Fig. 1 (c), we sample from the combination of labeled and unlabeled datasets, and consider all the augmented copies (Fig. 1 (c) represents the case when A=2A=2) to calculate u\mathcal{L}_{u}. Since both data-point x2x_{2}, and x2x_{2}^{{}^{\prime}} have the same guesses label y2y_{2}, the minimization of unsupervised loss acts as the consistency regularization. The overall SSL loss is the combination of the sup\mathcal{L}_{sup} and u\mathcal{L}_{u}, with λu\lambda_{u} as the weighting term between two-loss terms: =sup+λuu\mathcal{L}=\mathcal{L}_{sup}+\lambda_{u}\mathcal{L}_{u}.

3.3. 𝒮𝒮ss:\mathcal{SSL}_{ss}: SSL with self-supervised learning

In this section, we propose a novel SSL method for the task of semantic segmentation utilizing the idea from self-supervised learning (Kolesnikov et al., 2019). The self-supervised learning uses unlabeled data to formulate a pretext learning task such as predicting context, for which a target objective can be computed without supervision (Kolesnikov et al., 2019). In our case, predicting labels from the inversion of the transformed image’s model prediction is the pretext learning task (Fig. 1 (b)). This also helps us to utilize widely used image transformations, including rotation and translation, which we were unable to use in Section 3.2. Note that these image transformations are important to account for variations in relative motion of the eye with respect to camera. We now present how we perform label guessing, which is the primary difference in comparison to 𝒮𝒮D\mathcal{SSL}_{D}.

3.3.1. Guessing Labels

We first consider the same domain augmentation strategies used in 3.2.1 to create AA separate copies for each data point xux_{u}. For each of these augmented data points, with different probability p1p_{1}, and p2p_{2}, we further introduce image transformation-based augmentation of rotation, and translation, respectively. Collectively, we will use 𝒯\mathcal{T} to denote these image transformations. We pass them into our segmentation network to obtain the corresponding labels and then calculate the inverse transform 𝒯1\mathcal{T}^{-1} of these labels to bring them back into the same spatial space where the initial data points reside. Finally, we compute the average of the inverse of the model’s prediction yuy_{u}, as the guessed labels for all AA separate copies of data point xux_{u}:

(2) yu=1Aa=1A𝒯𝟏(f(𝒯(xu,a);θ))y_{u}=\frac{1}{A}\sum_{a=1}^{A}\mathbf{\mathcal{T}^{-1}}(f(\mathbf{\mathcal{T}}(x_{u,a});\theta))

This is represented schematically in Fig. 1 (b), and since the model has to be consistent on the predictions being invariant to both domain-specific augmentation and image transformations, this method of guessing the labels adds a more potent form of regularization compared to the one we presented in 3.2.1.

3.3.2. SSL Objective

Similar to 3.2.2 and as shown schematically in Fig. 1 (c), we use the combination of supervised (sup\mathcal{L}_{sup}) and unsupervised (u\mathcal{L}_{u}, ss\mathcal{L}_{ss}) loss for our objective function. u\mathcal{L}_{u} is the L2 loss computed between the predicted softmax prediction of data from domain-specific augmentation and the guessed label from 3.2.1. ss\mathcal{L}_{ss} is the L2 loss computed between the same data and the guessed label from 3.3.1. The overall objective function is =sup+λuu+λssss\mathcal{L}=\mathcal{L}_{sup}+\lambda_{u}\mathcal{L}_{u}+\lambda_{ss}\mathcal{L}_{ss}, where λu\lambda_{u} and λss\lambda_{ss} are the corresponding weighting terms.

4. Experiments

4.1. Datasets and Evaluation

We evaluated our proposed pipeline on the OpenEDS-2019 dataset. OpenEDS-2019 has well-defined train (n=8916n=8916), validation (2403), and test (1440) sets where labels for the test set are not provided for public use. We evaluate our models on the validation set. Qualitative results on the test set is provided as supplementary video. Note that OpenEDS-2019 contains eye images of 94/28 (train/validation) different participants with multiple images. Further description of splits and test cases are explained in Section 5. All images were resized to 240×320240\times 320 for faster computation. For each dataset, models are tested on the fixed test set. Evaluations of the segmentation performance are with the standard mean Intersection over Union (IoU) score (Everingham et al., 2005).

4.2. Implementation Details

As the primary purpose of the paper is designing an SSL paradigm to leverage the unlabeled dataset and not in designing a model architecture, we leverage the publicly available RITnet (Chaudhary et al., 2019) that is computationally efficient. Models are trained using Adam optimizer  (Kingma and Ba, 2014) with a learning rate of 0.001 and a batch size of eight images (four labeled and four unlabeled) for 250 epochs. The model with the best validation score was reported.

In our experiments, we used λ1=1\lambda_{1}=1, λ2=20\lambda_{2}=20, λ3=1\lambda_{3}=1. Two unsupervised loss hyper-parameters λu\lambda_{u} and λss\lambda_{ss} are two linearly increasing weights with slope 0.02 and 0.002 per epoch respectively. Initially, the loss scheduling scheme gives prominence to the supervised loss, and after a few epochs, unsupervised loss starts to increase.

For domain-specific augmentation, we varied the contrast and luminance of the images. We selected random values in [0.8, 1.2] with step size of 0.05 for Gamma correction and random clip parameter in (1.0, 1.2, 1.5, 1.5, 1.5, 2.0) and grid size (2, 4, 8, 8, 8, 16) for CLAHE. For 𝒯\mathcal{T} transform, random rotation [-5, 5] and translation [-20, 20 pixels] were performed with 50% and 80% probability respectively. Note the images in the dataset mostly varied in the translated form, so high importance was given to translation. Each image had 50% chance of being transformed with 𝒯\mathcal{T}. Additionally, we also used basic augmentations proposed in  (Chaudhary et al., 2019) for all images.

5. Results and Discussion

The analysis of the results is based on two essential questions we want to address. First, does the availability of a large unlabeled dataset assist the learning from limited labeled data? Within this, we further ask what is the minimum number of labeled images per subject we can use in our pipeline to improve the performance significantly? Second, will the model have better performance when the labeled images are from a single person, or when they are drawn from multiple subjects? To answer these questions, we split our analysis into two parts: training with a varying number of labeled samples from multiple subjects and from a single subject. Further, we also compare the segmentation performance for essential eye features such as pupil and iris as it is important for reliable gaze estimation.

5.1. Training with Multiple Subjects

In this setup, the models were trained on varying numbers of labeled images (𝒳l\mathcal{X}_{l}) and a fixed number of unlabeled images (𝒳u\mathcal{X}_{u} = 8916) from multiple subjects. The models were evaluated on a fixed test set (2403). The equal number of images are selected from the same subject when 𝒳l\mathcal{X}_{l} \geq 94 and a random sample is chosen from different subjects for 𝒳l\mathcal{X}_{l} \leq 94. In Fig. 2, the accuracy increases (88.28% \rightarrow 94.71%) as 𝒳l\mathcal{X}_{l} increases (4 \rightarrow 940) when models are trained on labeled images only (𝒮L\mathcal{S}_{L}). Similar behavior is observed with 𝒮𝒮D\mathcal{SSL}_{D} (92.42% \rightarrow 94.67%) framework. However, in all cases for 𝒳l\mathcal{X}_{l}\leq 188, 𝒮𝒮D\mathcal{SSL}_{D} achieves improved performance compared to training with 𝒮L\mathcal{S}_{L} (e.g.,4.69% for 𝒳l\mathcal{X}_{l}=4 & 0.21% for 𝒳l\mathcal{X}_{l}=94). This demonstrate that SSL framework can be helpful for eye image segmentation. Furthermore, the 𝒮𝒮SS\mathcal{SSL}_{SS} improves the performance against 𝒮L\mathcal{S}_{L} (e.g.,4.83% for 𝒳l\mathcal{X}_{l}=4 & 0.02% for 𝒳l\mathcal{X}_{l}=94) and 𝒮𝒮D\mathcal{SSL}_{D} by further upto 0.33%. This improvement demonstrates effective utilization of unlabeled images. No clear benefits of SSL were observed for 𝒳l\mathcal{X}_{l}\geq 188, which clearly signifies our proposed SSL frameworks are more suited when small amount of labeled data are available.

Most of the variation of model performance is expected in when number of 𝒳l\mathcal{X}_{l} is small. To verify this, we trained our model by randomly selecting different subjects for cases when 𝒳l\mathcal{X}_{l} is 4 and 12. The variations in the segmentation performance do not affect the incremental gain we observed from 𝒮L\mathcal{S}_{L} to 𝒮𝒮D\mathcal{SSL}_{D} and from 𝒮𝒮D\mathcal{SSL}_{D} to 𝒮𝒮SS\mathcal{SSL}_{SS} (Table 1).

Table 1. Model performance for two subsets (along rows) when 𝒳l\mathcal{X}_{l} is 4 & 12 is shown for three frameworks (𝒮L\mathcal{S}_{L}, 𝒮𝒮D\mathcal{SSL}_{D}, and 𝒮𝒮SS\mathcal{SSL}_{SS}). For each subset, % change represents the improvement from 𝒮L\mathcal{S}_{L} to 𝒮𝒮SS\mathcal{SSL}_{SS}.
𝒳l\mathcal{X}_{l} 4 images 12 images
Subset 𝒮L\mathcal{S}_{L} 𝒮𝒮D\mathcal{SSL}_{D} 𝒮𝒮SS\mathcal{SSL}_{SS} % change 𝒮L\mathcal{S}_{L} 𝒮𝒮D\mathcal{SSL}_{D} 𝒮𝒮SS\mathcal{SSL}_{SS} % change
I 87.42 91.59 92.04 5.28 91.78 93.21 93.36 1.72
II 88.28 92.42 92.54 4.83 91.52 92.81 93.05 1.67
Refer to caption
Figure 2. IoU score for three cases (𝒮L\mathcal{S}_{L}: blue, 𝒮𝒮D\mathcal{SSL}_{D}: green, and 𝒮𝒮SS\mathcal{SSL}_{SS}: red) are shown with varying numbers of 𝒳l\mathcal{X}_{l} and fixed 𝒳u\mathcal{X}_{u}. The number alongside arrows indicate respective improvement (in %) over 𝒮L\mathcal{S}_{L}.

5.2. Training with a Single Subject

Here, we consider the effect of training the model using 𝒳l\mathcal{X}_{l} from a single subject and a fixed 𝒳u\mathcal{X}_{u} from multiple subjects. For each dataset, we randomly selected two subjects (P1 and P2) for our analysis. For each subject, we use all the available samples as 𝒳l\mathcal{X}_{l} to train out model. The results are demonstrated in Fig. 3 (left and right), respectively, as percentage improvement as we go from 𝒮L\mathcal{S}_{L} to 𝒮𝒮D\mathcal{SSL}_{D} and from 𝒮𝒮D\mathcal{SSL}_{D} to 𝒮𝒮SS\mathcal{SSL}_{SS}. For P1, we further experiment using varying subsets of 𝒳l\mathcal{X}_{l} (4 and 12) from a single subject. Note that the addition of 𝒮𝒮SS\mathcal{SSL}_{SS} provided additional boost up to 1.31% compared to 𝒮𝒮D\mathcal{SSL}_{D}. The only cases where 𝒮𝒮SS\mathcal{SSL}_{SS} did not outperform 𝒮𝒮D\mathcal{SSL}_{D} was when trained on large samples of images of a single person (as shown in Fig.  3).

Compared to (Section 5.1), where we trained with multiple subjects, training with a single subject degrades the performance severely for 𝒮L\mathcal{S}_{L} (e.g. 88.28 % \rightarrow 83.32% when trained on 𝒳l=4\mathcal{X}_{l}=4). We believe inherent variations among subjects helped while training with multiple subjects. However, in both SSL frameworks, with the addition of 𝒳u\mathcal{X}_{u}, this gap is reduced (e.g. 92.54 % \rightarrow 92.20% when trained on 𝒳l=4\mathcal{X}_{l}=4 for 𝒮𝒮SS\mathcal{SSL}_{SS}).

Refer to caption
Figure 3. Demonstration of improvement (in %) for cases 𝒮L\mathcal{S}_{L} to 𝒮𝒮D\mathcal{SSL}_{D} (left) and 𝒮𝒮D\mathcal{SSL}_{D} to 𝒮𝒮SS\mathcal{SSL}_{SS} (right) when models are trained on two subjects (red and green). For P1 (green), we further test the change in model performance for various subsets of 𝒳l\mathcal{X}_{l}.

5.3. Eye part segmentation

In Table  2, we present the IoU score separately for the pupil and iris classes. Similar behavior is observed as Section  5.1 and  5.2, suggesting that every class’s accuracy is improved with these frameworks. Note that higher segmentation accuracy means a better chance of ellipse fits and thus accurate gaze estimation. In Fig.  4, we present a predicted segmentation mask for various test cases for qualitative comparison. Here, the spurious patches start to disappear as the number of labeled images are increased or with the introduction of unlabeled images with the SSL framework.

Table 2. Comparison of pupil and iris (inside parenthesis) class IoU scores for three frameworks (𝒮L\mathcal{S}_{L}, 𝒮𝒮D\mathcal{SSL}_{D}, and 𝒮𝒮SS\mathcal{SSL}_{SS}) for varying number of 𝒳l\mathcal{X}_{l} and fixed number of 𝒳u\mathcal{X}_{u}. P1 indicates samples from a single subject. Bold values indicate the best performance within each test case (along column). % change represents the improvement from 𝒮L\mathcal{S}_{L} to 𝒮𝒮SS\mathcal{SSL}_{SS}

𝒳l\mathcal{X}_{l} 4 12 24 48 4 (P1) 12 (P1) 61 (P1) 𝒮L\mathcal{S}_{L} 88.25 (89.06) 91.84 (92.87) 92.31 (93.25) 92.60 (93.55) 88.00 (86.98) 87.83 (88.22) 90.92 (90.24) 𝒮𝒮D\mathcal{SSL}_{D} 91.97 (92.75) 92.50 (94.00) 92.69 (94.01) 92.69 (94.02) 91.73 (92.84) 91.63 (93.17) 92.67 (93.77) 𝒮𝒮SS\mathcal{SSL}_{SS} 91.85 (93.05) 92.62 (94.14) 92.81 (94.17) 93.01 (94.28) 91.87 (93.35) 91.78 (93.62) 92.86 (93.93) % change 4.08 (4.48) 0.85 (1.37) 0.54 (0.99) 0.44 (0.78) 4.4.40 (7.32) 4.50 (6.12) 2.13 (4.09)

Refer to caption
Figure 4. Four samples of the test set with its corresponding ground-truth and network predictions for the number of cases are shown in adjacent blocks. As the number of images increases, the confidence in prediction and unwanted spurious patches are reduced when models are trained on 𝒮L\mathcal{S}_{L}. For SSL approaches, the confidence is prediction is more even when a small number of 𝒳l\mathcal{X}_{l} are used. No significant difference is visible for the two SSL approaches, which mostly vary in finer details.

6. Conclusion and Future Work

We present two SSL frameworks that leverage the domain-specific augmentations and pretext learning task that accounts for spatially varying transformations. With this, we demonstrate a substantial increase in segmentation performance for a small number of labeled images by considering hidden relationships present in a large number of unlabeled images. The efficacy of the two frameworks is demonstrated on OpenEDS-2019 datasets. This paper opens an exciting area for the eye-tracking community to focus on the variability of the subjects rather than labeling a large number of images of a particular subject. In the future, we would like to investigate the effect of curating labeled datasets (e.g., considering eye position and blinks) instead of random selection in the SSL framework. Pre-trained models and source code will be made available.

References

  • (1)
  • Berthelot et al. (2019) David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin A Raffel. 2019. Mixmatch: A holistic approach to semi-supervised learning. In Advances in Neural Information Processing Systems. 5049–5059.
  • Chaudhary et al. (2019) Aayush K Chaudhary, Rakshit Kothari, Manoj Acharya, Shusil Dangi, Nitinraj Nair, Reynold Bailey, Christopher Kanan, Gabriel Diaz, and Jeff B Pelz. 2019. RITnet: real-time semantic segmentation of the eye for gaze tracking. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW). IEEE, 3698–3702.
  • Everingham et al. (2005) Mark Everingham, Andrew Zisserman, Christopher KI Williams, Luc Van Gool, Moray Allan, Christopher M Bishop, Olivier Chapelle, Navneet Dalal, Thomas Deselaers, Gyuri Dorkó, et al. 2005. The 2005 pascal visual object classes challenge. In Machine Learning Challenges Workshop. Springer, 117–176.
  • Garbin et al. (2019) Stephan J Garbin, Yiru Shen, Immo Schuetz, Robert Cavin, Gregory Hughes, and Sachin S Talathi. 2019. Openeds: Open eye dataset. arXiv preprint arXiv:1905.03702 (2019).
  • Gyawali et al. (2020) Prashnna Kumar Gyawali, Sandesh Ghimire, Pradeep Bajracharya, Zhiyuan Li, and Linwei Wang. 2020. Semi-supervised Medical Image Classification with Global Latent Mixing. arXiv preprint arXiv:2005.11217 (2020).
  • Gyawali et al. (2019) Prashnna Kumar Gyawali, Zhiyuan Li, Sandesh Ghimire, and Linwei Wang. 2019. Semi-supervised learning by disentangling and self-ensembling over stochastic latent space. In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 766–774.
  • Kalluri et al. (2019) Tarun Kalluri, Girish Varma, Manmohan Chandraker, and CV Jawahar. 2019. Universal semi-supervised semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision. 5259–5270.
  • Kervadec et al. (2018) Hoel Kervadec, Jihene Bouchtiba, Christian Desrosiers, Éric Granger, Jose Dolz, and Ismail Ben Ayed. 2018. Boundary loss for highly unbalanced segmentation. http://arxiv.org/abs/1812.07032
  • Kingma and Ba (2014) Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. Journal of neuroscience methods 148, 2 (12 2014), 167–76. https://doi.org/10.1016/j.jneumeth.2005.04.009
  • Kolesnikov et al. (2019) Alexander Kolesnikov, Xiaohua Zhai, and Lucas Beyer. 2019. Revisiting self-supervised visual representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 1920–1929.
  • Kothari et al. (2020b) Rakshit Kothari, Zhizhuo Yang, Christopher Kanan, Reynold Bailey, Jeff B Pelz, and Gabriel J Diaz. 2020b. Gaze-in-wild: A dataset for studying eye and head coordination in everyday activities. Scientific reports 10, 1 (2020), 1–18.
  • Kothari et al. (2020a) Rakshit S Kothari, Aayush K Chaudhary, Reynold J Bailey, Jeff B Pelz, and Gabriel J Diaz. 2020a. EllSeg: An Ellipse Segmentation Framework for Robust Gaze Tracking. arXiv preprint arXiv:2007.09600 (2020).
  • Laine and Aila (2017) Samuli Laine and Timo Aila. 2017. Temporal ensembling for semi-supervised learning. In ICLR.
  • Ouali et al. (2020) Yassine Ouali, Céline Hudelot, and Myriam Tami. 2020. Semi-Supervised Semantic Segmentation with Cross-Consistency Training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12674–12684.
  • Palmero et al. (2020a) Cristina Palmero, Oleg V Komogortsev, and Sachin S Talathi. 2020a. Benefits of temporal information for appearance-based gaze estimation. arXiv preprint arXiv:2005.11670 (2020).
  • Palmero et al. (2020b) Cristina Palmero, Abhishek Sharma, Karsten Behrendt, Kapil Krishnakumar, Oleg V Komogortsev, and Sachin S Talathi. 2020b. OpenEDS2020: Open Eyes Dataset. arXiv preprint arXiv:2005.03876 (2020).
  • Park et al. (2020) Seonwook Park, Emre Aksan, Xucong Zhang, and Otmar Hilliges. 2020. Towards End-to-end Video-based Eye-Tracking. In European Conference on Computer Vision. Springer, 747–763.
  • Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. International Conference on Medical image computing and computer-assisted intervention. Springer, Cham, 2015. 9351 (2015), 234–241. https://doi.org/10.1007/978-3-319-24574-4{_}28
  • Shen et al. (2020) Yiru Shen, Oleg Komogortsev, and Sachin S Talathi. 2020. Domain Adaptation for Eye Segmentation. In European Conference on Computer Vision. Springer, 555–569.
  • Wu et al. (2019) Zhengyang Wu, Srivignesh Rajendran, Tarrence Van As, Vijay Badrinarayanan, and Andrew Rabinovich. 2019. EyeNet: A Multi-Task Deep Network for Off-Axis Eye Gaze Estimation. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW). IEEE, 3683–3687.
  • Yiu et al. (2019) Yuk-Hoi Yiu, Moustafa Aboulatta, Theresa Raiser, Leoni Ophey, Virginia L Flanagin, Peter zu Eulenburg, and Seyed-Ahmad Ahmadi. 2019. DeepVOG: Open-source pupil segmentation and gaze estimation in neuroscience using deep learning. Journal of neuroscience methods 324 (2019), 108307.
  • Zhou et al. (2003) Dengyong Zhou, Olivier Bousquet, Thomas Lal, Jason Weston, and Bernhard Schölkopf. 2003. Learning with local and global consistency. Advances in neural information processing systems 16 (2003), 321–328.
  • Zuiderveld (1994) Karel Zuiderveld. 1994. Contrast Limited Adaptive Histogram Equalization. In Graphics Gems. https://doi.org/10.1016/b978-0-12-336156-1.50061-6