This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Fine-Grained Visual Attention Approach for Fingerspelling Recognition in the Wild

Kamala Gajurel, Cuncong Zhong, Guanghui Wang
Department of Electrical Engineering and Computer Science, University of Kansas, Lawrence KS, USA
Department of Computer Science, Ryerson University, Toronto ON, Canada, M5B 2K3
Abstract

Fingerspelling in sign language has been the means of communicating technical terms and proper nouns when they do not have dedicated sign language gestures. Automatic recognition of fingerspelling can help resolve communication barriers when interacting with deaf people. The main challenges prevalent in fingerspelling recognition are the ambiguity in the gestures and strong articulation of the hands. The automatic recognition model should address high inter-class visual similarity and high intra-class variation in the gestures. Most of the existing research in fingerspelling recognition has focused on the dataset collected in a controlled environment. The recent collection of a large-scale annotated fingerspelling dataset in the wild, from social media and online platforms, captures the challenges in a real-world scenario. In this work, we propose a fine-grained visual attention mechanism using the Transformer model for the sequence-to-sequence prediction task in the wild dataset. The fine-grained attention is achieved by utilizing the change in motion of the video frames (optical flow) in sequential context-based attention along with a Transformer encoder model. The unsegmented continuous video dataset is jointly trained by balancing the Connectionist Temporal Classification (CTC) loss and the maximum-entropy loss. The proposed approach can capture better fine-grained attention in a single iteration. Experiment evaluations show that it outperforms the state-of-the-art approaches.

Index Terms:
Fingerspelling recognition; visual attention; Transformer network.

I INTRODUCTION

Sign language is the most structured form of non-verbal communication for people with hearing and speaking deficiency. It incorporates the symbols or gestures made by hand including facial expressions and postures of the body. There are over 300 different sign languages and about 70 million deaf people around the world according to the report of the World Federation of the Deaf [27]. The sign language dialects vary from place to place although some of the gestures may overlap. One of the families of sign language is the American Sign Language (ASL). It is the natural language for around 500,000 people in the US and Canada [7]. ASL is also the third most widely studied language in the United States in addition to English according to the Modern Language Association’s study of US colleges and Universities [12]. ASL has its own distinct syntax and grammar which change and develop over time.

One of the most important elements of sign language is fingerspelling. Fingerspelling is practical but constrained component of sign language [20]. It is the representation of the individual letters and numbers using standard finger positions. Fingerspelling is generally used when there are no distinct signs for the words [24]. These words are either technical words or proper nouns, such as personal names, places’ names, brands, etc. Although there exist signs for words, the fingerspelling can be employed to emphasize them in focus construction [14]. Fingerspelling covers 35% in ASL and is often used in social interaction or conversations involving technical events [15]. Fig. 1 shows the symbols of alphabets used in ASL.

Refer to caption
Figure 1: The symbols of alphabets in ASL Fingerspelling [25]

Sign language recognition is the task of classifying the symbols or gestures made by hand along with the facial expressions and postures of the body. Two individuals can easily understand the symbols or gestures made by each other if both have a well understanding of the language. However, the same symbol cannot be made comprehensible if one of the communicators is a person with limited knowledge of the language or a machine. Sign language recognition has been a very important means to reduce the communication barrier for deaf individuals [24]. Furthermore, there are many benefits of sign language recognition in applications such as assistive technologies for the deaf, interpreting services, translation services, human-computer interaction, as well as robotics [20].

Fingerspelling recognition, a subtask of sign language recognition, is essential for communication in specialized areas and online social media for the deaf [24]. Fingerspelling recognition can also assist in the teaching and learning process for beginners. Fingerspelling in ASL is relatively simple in comparison to the general sign language recognition task since fingerspelling includes only one hand to make symbols most of the time [24]. This work is mainly focused on the fingerspelling recognition task.

Although fingerspelling recognition only involves one hand with limited vocabulary, it remains a very challenging task. The main challenge in fingerspelling recognition is the large inter-class similarity and large intra-class variation in alphabets leading to incorrect classification. The difficulty in recognition also depends upon the setting where the dataset is prepared. The recent increase in deaf online media has motivated recognition in the wild dataset which reflects the real-world scenario. In such data, the difficulty arises due to the motion blur of low-quality video [24]. Other challenges that exist in the recognition task are the variability in the appearance of signers’ bodies and the large number of degrees of freedom [20], diverse viewpoint and orientation of the hand [30], as well as the difficulty in capturing the spatial and temporal context during fingerspelling.

It is essential to obtain the fine-grained features of the hand to handle the challenges associated with fingerspelling recognition. The signing hand should be localized to identify the region of interest by focusing on the specific spatial region. Further refinement in the region of interest should be done to capture the temporal context as well as the subtle differences between the gestures.

State-of-the-art for capturing fine-grained features in fingerspelling recognition is based on the spatial attention with recurrent neural network [24]. This paper implements the iterative visual attention that helps in obtaining the regions of interest with high resolution. Our proposed model introduces the context-based attention mechanism to refine the attention weights, followed by the Transformer encoder model. This model also implements the maximum entropy loss in the field of fingerspelling recognition for regularization. Our model outperforms the state-of-the-art performance in a single iteration by around 3%. Moreover, the computation time for training and inference are significantly reduced due to the use of the Transformer model and single iteration. The source code of the proposed model will be available on the author’s homepage 111https://github.com/rucv/Sequence-Fingerspelling.

The rest of the paper is organized as follows. In the next section, we describe the related work in the domain of fingerspelling recognition. Section III formally introduces the sequence-to-sequence prediction problem in fingerspelling recognition. Section IV describes the proposed model. The experiments conducted and the results are described in sections V, and VI, respectively. Finally, we conclude the paper in section VII.

II Related Work

With the fast development of deep learning, convolutional neural networks have been successfully applied to various classification and recognition tasks [2, 9, 19, 21, 28, 29]. With the availability of the large-scale image/video dataset in online media such as youtube and sign language websites, many sign language recognition approaches using the RGB-based data have been proposed. In this work, we focus on the sequence-to-sequence prediction task with the fingerspelling videos. A number of prior works have been conducted to address this problem, taking videos as input and letter sequence as output [8, 23, 22, 11, 16, 24].

Our work is closely related to two studies [22] and [24]. The paper [22] introduces the naturally occurring video dataset ChicafoFSWild collected from YouTube, aslized.org, and deafvideo.tv. This paper first segments out the signing hand using the hand detector model from the image frames. The images so obtained are then passed through the CNN layer for feature extraction. Then the features are fed through long short-term memory (LSTM). By using either connectionist temporal classification (CTC) or attention-based LSTM decoder, the prediction of the sequence is performed. The result shows that CTC based prediction model gives better performance in comparison to decoder-based.

Paper [24] works on the same dataset and it has introduced another larger dataset ChicagoFSWild+ for further experimentation. The main idea of this paper is to use an end-to-end model based on an attention-based mechanism. This designed model does not explicitly localize the hand as done in [22]. By using the iterative attention mechanism followed by the CTC model, this paper improves the performance of ChicagoFSWild. In addition, the performance in test data based on training using both wild and wild+ data shows that the result in test data improves by a significant amount. The proposed work uses Transformer based attention mechanism instead of an RNN-based spatial attention mechanism.

Another work done in ChicagoFSWild is by the paper [17]. This paper uses 3D hand pose estimation. The evaluation shows that the performance in the test data increases slightly in comparison to [24]. The paper [1] has proposed the Transformer-based encoder-decoder architecture for sign language recognition. Our proposed model differs from the previous study and we focus on fingerspelling recognition using context-based attention followed by the Transformer encoder only.

III Problem Formulation

Sequence-to-sequence prediction with videos falls under a challenging task in fingerspelling recognition since it involves the video dataset with continuous multiple signs in a single video. Given a set of images representing video frames I1I_{1}, I2I_{2}…,ITI_{T}, the task is to predict the target letter sequence l1l_{1}, l2l_{2},….,lKl_{K} where KTK\leq T. The labeling in this task is unsegmented labeling with no alignments between the frames and labels. Connectionist Temporal Classification (CTC) [5] is used to find the predicted letter sequence in such a setting.

For given video frames I1:TI^{1:T}, the first step for the CTC model is to find a function ff that learns the probability distribution of labels for each frame y1:Ty^{1:T}.

y1:T=σ(f(I1:T,θ))y^{1:T}=\sigma(f(I^{1:T},\theta)) (1)

where σ\sigma is a softmax function, θ\theta is the model parameters, and y[0,1]Cy\in[0,1]^{C^{\prime}} contains the probabilities of CC classes and one extra class blank which indicates none of the classes. These output probabilities define the different possible alignments of label sequences with the input sequence. The probability of an alignment π\pi for a given sequence is defined as follows:

p(π|𝐱1:T)=t=1Tyπttp(\pi|\mathbf{x}^{1:T})=\prod_{t=1}^{T}{y_{\pi_{t}}^{t}} (2)

where πCT\pi\in C^{\prime T} and 𝐱\mathbf{x} is the visual features obtained from ff.

Next, a many-to-one map 𝐁:CTCT\mathbf{B}:C^{\prime T}\mapsto{C^{\leq T}} is defined by removing all repeated labels and blanks from the alignment. For example: 𝐁(aassl)=𝐁(asll)=asl\mathbf{B}(aa-ss-l)=\mathbf{B}(-a-sll)=asl. The path with same character in adjacent position is handled by a blank in between the repeat characters. For example: 𝐁(eegggg)=egg\mathbf{B}(ee-ggg-g)=egg. The map 𝐁\mathbf{B} is used to define the conditional probability of letter sequence 𝐥=l1,l2..,lk\mathbf{l}=l_{1},l_{2}..,l_{k} given visual features 𝐱\mathbf{x}.

p(𝐥|𝐱)=π𝐁1(𝐥)p(π|𝐱)p(\mathbf{l}|\mathbf{x})=\sum_{\pi\in\mathbf{B}^{-1}(\mathbf{l})}p(\pi|\mathbf{x}) (3)

The CTC model is used to optimize this probability for the ground-truth label sequences by using CTC-loss. In addition to the CTC-loss, maximum entropy loss (MEL) [3] is also applied as a regularization to avoid overfitting.

MEL=log2(C)1Tt=1Tk=1Cp(ykt)log2(p(ykt))\text{MEL}=\log_{2}(C^{\prime})-{\frac{1}{T}\sum_{t=1}^{T}\sum_{k=1}^{C^{\prime}}-p(y^{t}_{k})\log_{2}(p(y^{t}_{k}))} (4)

where TT is total video frames, ykty_{k}^{t} is the probability of class kk for frame tt. The goal of using MEL is to encourage the probability distribution that does not spike toward single class.

IV Methodology

The proposed approach for sequence-to-sequence prediction is depicted in Fig. 2. The video dataset is first converted to the sequence of frames and the frames are processed to obtain the dimension of 224x224x3. The frames are passed through the pre-trained ResNet18 parallelly to extract the feature maps with dimensions 256x14x14 for each frame. The output of ResNet18 is passed through the context-based attention module, described in section IV-A, to obtain the spatial attention weights for each frame.

The context-based attention weights are multiplied with the feature maps obtained from ResNet18 to produce the attended feature maps. The attended feature maps of each frame are then passed through adaptive average pooling. This pooling method converts each frame to the size of 256x9x9. The output from the pooling layer is now used by the fully connected layer to summarize the features into 256-dimensions image embedding for each frame. The summarized embedding along with the positional information obtained from the positional encoder [26] is now passed through the transformer encoder layer. The Transformer encoder layer generates the context and relevant information from the input. After the encoder layer, the 256-dimensional encoding is transferred to the classifier. The classifier is a fully connected layer that produces the output of C+1 dimensions. Here, C stands for the number of classes and one extra class is for blank character that represents the transition of signs. The log probabilities of each class are now passed to the Connectionist Temporal Classification (CTC) decoder to predict the letter sequence of the given video as the output.

Refer to caption
Figure 2: Block diagram for the sequence-to-sequence prediction of video dataset
Refer to caption
Figure 3: Block diagram for the context-based attention module.

IV-A Context-based Attention Module

Context-based attention shown in Fig. 3 comprises of spatial attention layer, positional encoder and self-attention layer along with the priors (PP). First, the feature map of each frame is processed to obtain the corresponding spatial attention map (𝐀𝐢\mathbf{A_{i}}).

𝐀𝐢=δ[δ(𝐅𝐢×Wa)×Wv],i{1,..,T}\mathbf{A_{i}}=\delta[\delta(\mathbf{F_{i}}\times W_{a})\times W_{v}],\forall i\in\{1,..,T\} (5)

where, 𝐅𝐢14×14×256\mathbf{F_{i}}\in\mathbb{R}^{14\times 14\times 256} is the transposed feature map, Wa256×512W_{a}\in\mathbb{R}^{256\times 512} and Wv512×1W_{v}\in\mathbb{R}^{512\times 1} are the attention weights learned using back-propagation, and δ\delta is the ReLU activation function.

The spatial attention layer does not take the context of the prior frames. So, a positional encoder followed by a self-attention layer is used. The positional encoder is used to get the relative positional information from all the frames. The self-attention layer [26] is then applied to this attention map (𝐀L×14×14\mathbf{A}\in\mathbb{R}^{L\times 14\times 14}) from the spatial attention layer to obtain the refined map 𝐀𝐬\mathbf{A^{s}}.

Next, the weighted sum of the prior and the attention weights of each frame obtained from the self-attention layer followed by softmax is calculated to find the final attention weight of 14×1414\times 14 dimension.

𝐀𝐢𝐩=wp×Pi+(1wp)×σ(𝐀𝐢𝐬)\mathbf{A^{p}_{i}}=w_{p}\times P_{i}+(1-w_{p})\times\sigma(\mathbf{A^{s}_{i}}) (6)

where 𝐀𝐢𝐩\mathbf{A^{p}_{i}} is the final attention weight and wpw_{p} is the weight for prior learned by the model. Prior is generated using optical flow [4] using the same process followed by [24].

The final attention is now multiplied with the feature map from ResNet18 to obtain the attended feature map.

𝐅𝐢=𝐅𝐢𝐀𝐢𝐩\mathbf{F_{i}}=\mathbf{F_{i}}\odot\mathbf{A^{p}_{i}} (7)

where 𝐅𝐢\mathbf{F_{i}} is the feature map of each frame.

IV-B Transformer Encoder

The Transformer encoder [26] consists of a multi-head attention layer, feed-forward network along with add and normalize layer. The multi-head attention layer helps the model to jointly attend to the information from different positions of embeddings. The multi-head attention used here is masked multi-head attention so that each embedding can take the position information from previous frame embeddings only for refining the attention in each frame. The attention layer is followed by the add and normalize layer which acts as the residual connection between two sub-layers. The feed-forward network in the transformer encoder processes the attended embedding from the multi-head attention layer independently and identically.

IV-C CTC Decoder

We consider the following three types of CTC decoding techniques.

  • Greedy decoding. Greedy decoding is the method in which the decoder chooses the letter with maximum probability for each frame. Mathematically,

    l^t=argmaxk(ykt),k{1,.,C+1}\hat{l}_{t}=argmax_{k}(y^{t}_{k}),\forall k\in\{1,....,C+1\} (8)

    where l^t\hat{l}_{t} is the predicted letter and ykty^{t}_{k} is the probability distribution of each class k.

    Although the greedy decoding method is fast, it chooses the alignment path with the highest probability as the solution, while there exist many alignment paths that merge into a single sequence label. For example: let ’cat’ be the real sequence with 5 frames. There may be alignments such as -oat-, -ccat, --cat, cc-at, and c-aat with probabilities 0.40.4, 0.20.2, 0.20.2, 0.10.1, and 0.10.1, respectively. In this case, if the greedy method is employed, the prediction will be oat. However, other alignments collapse to cat with a probability of 0.60.6. Thus, it is better to use a beam search method that mitigates this problem.

  • Beam search decoding. Beam search decoding in CTC expands all the possible next states and retains ‘BW’ (beam width) most likely states as the sequence is constructed. The retaining is done after collapsing the alignments not ending with blank. Moreover, the probability scores are merged for those alignments. BW is the hyper-parameter that balances the decoding speed and accuracy of performance.

  • Beam Search with Language Model. The Character-level Language model aims to predict the next character given the previous characters in the sequence. Integrating language model in CTC decoder can boost the performance of the model by capturing the relationship between the characters for the given domain. For the first time step, the language model uses the observed marginal probabilities of the characters. For the subsequent characters, the language model considers the conditional probabilities of the character given previous kk characters. The value kk is also known as the order of the language model.

    The language model can be integrated with the beam search decoding by modifying the scores of the sequences obtained after merging the same states. Let sbs_{b} be the score of beam search and α\alpha be the weight of the language model. Then the score after applying the language model is

    s=(1α)×sb+α×P(l^t|l^t1,,l^tk)s=(1-\alpha)\times s_{b}+\alpha\times P(\hat{l}_{t}|\hat{l}_{t-1},...,\hat{l}_{t-k}) (9)

    where l^t\hat{l}_{t} is the predicted character at position tt.

V Experiments

V-A Dataset

The proposed model is tested on the ChicagoFSWild [22] dataset. The dataset consists of the fingerspelling clips from ASL videos available on online media such as Youtube, aslized.org, and deafvideo.tv. There are 7,304 clip sequences in total. The data is split into 5,455 training, 981 validation (development), and 868 test sequences. The sequences are mostly right-handed (6,782 sequences), a few left-handed (522 sequences), and rarely both-handed (121 sequences). The dataset consists of 192 unique signers among which 91 are male and 77 are female. Each unique signer is assigned to only one of the data partitions for the signer independent setting. The vocabulary consists of 26 fingerspelling alphabets (a-z) and 5 other special characters (SPACE, &, ’, ., @). The special characters occur very rarely.

V-B Data Preprocessing

  • Whole frame The original raw video frames in the wild dataset are of different dimensions since the videos are collected from different online sites. To convert the frames into a uniform dimension, the frames are resized to the dimension of 224 ×\times 224. The resized frames are normalized using the mean [0.485, 0.456, 0.406] and std = [0.229, 0.224, 0.225]. The priors obtained from optical flow are resized to the feature map dimension of 14 ×\times14.

  • Face ROI In general, the fingerspelling is carried out such that the signing hand is near the face region. This motivates using the existing face detection approaches for obtaining the region of interest (ROI). The data preprocessing for Face ROI is followed from the paper [24]. After obtaining the frames with the region of interest, they are also resized to the dimensions of 224 ×\times 224 and normalized using the same mean and standard deviation used in the Whole frame case. Moreover, the priors are resized to 14 ×\times 14.

V-C Models and Setup

For our proposed model, the Transformer encoder uses 2 layers with masked 2-head attention. The masking is done in such a way that only 5 prior frames are considered to refine the attention in each frame. The number of hidden units for the feed-forward network is 1024. Dropouts of 0.3 and 0.1 are used for the Transformer layers and context-based attention layer respectively. The following three ablation studies have been conducted.

  • CTC loss: In this approach, we use only CTC loss while training the model. The purpose of this approach is to observe how well the proposed model performs without regularization and augmentation.

  • Maximum entropy loss: This approach uses CTC loss along with maximum entropy loss as regularization for model training. CTC loss generally gives the impulsive or spike distribution leading to overfitting [10]. To mitigate this problem, maximum entropy loss, described in equation 4, is used as regularization.

  • Horizontal flipping: The input dataset distribution consists of both right and left signing hands. However, the distribution is imbalanced with left-handed signers being about 7 % of the total samples. To generalize the model, some of the input training samples are flipped at random. This process of data augmentation alleviates the problem of imbalanced distribution of signing hands. The probability of flipping the input is taken as 0.3 after hyperparameter tuning.

The experiments are performed for different combinations of CTC loss, maximum-entropy loss, and horizontal flipping. All the experiments are implemented using the PyTorch framework [18] using one NVIDIA Tesla P100 GPU. PyTorch’s implementation of CTC-loss is used with AdamW optimizer [13]. The learning rate of 10410^{-4} is used by keeping other parameters of AdamW default.

The experiments are implemented for Whole frame and face ROI preprocessing and are trained for 20 epochs. For each experiment, the letter accuracy is reported for greedy decoding, beam search decoding222https://github.com/parlance/ctcdecode with a beamwidth of 20, and beam search with KenLM [6] character-level language model decoding. The language model weight (α\alpha) is tuned to 0.2. The letter accuracy for a predicted sequence is calculated using the approach followed by [22, 24]. The approach is to find the minimum edit distance between the predicted sequence and the ground truth letter sequence. Mathematically,

Letter accuracy(Lacc)=max(0,1S+D+IN)\textit{Letter accuracy}(L_{acc})=\max(0,1-\frac{S+D+I}{N}) (10)

where S, D, I represent the substitution, deletion, insertion respectively in the alignment. N is the number of total letters in the ground-truth sequence. The maxmax in the equation ensures positive accuracy when N<S+D+IN<S+D+I.

VI Result and Discussion

Table I shows the comparison of letter accuracy between the state-of-the-art models and the proposed model. The comparison is done on both the development (dev) and test data sets. As seen in the table, the proposed model outperforms both the development and test accuracy in comparison to the other two approaches. In both the proposed model and [24], Face ROI is used. The proposed model is trained on a single iteration using CTC loss, maximum entropy loss, and horizontal flipping. [24] is an iterative visual-based attention model trained on three iterations using CTC loss only. In both cases, beam search decoding with a language model (BS+LM) is employed while evaluating. Paper [17] has exploited and used the 3D hand pose information and reported the performance only in the test dataset as 47.93%.

TABLE I: Comparison of letter accuracy for continuous gestures on ChicagoFSWild benchmark dataset trained on ChicagoFSWild/train partition.
Experiment Letter Accuracy (%)
dev test
Shi et al. [24] 46.8 45.1
Parelli et al. [17] - 47.93
Proposed Model 46.96 48.36
TABLE II: Ablation study in development dataset of ChicagoFSWild based on different combinations of CTC loss, maximum entropy loss and horizontal flipping for Face ROI.
Study Cases Letter Accuracy (%) in dev
Greedy BS BS+LM
CTC loss only 38.20 38.76 39.50
CTC loss and Maximum Entropy loss 39.79 42.02 42.20
CTC loss and Horizontal flipping 42.52 43.56 44.44
CTC loss, Maximum Entropy loss and Horizontal flipping 44.82 45.77 46.96

Similar to the results of [24], it is observed that the performance for Face ROI is superior to that of Whole frame (33.68 % using beam search decoding). Thus, we only report the ablation study for Face ROI case. Table II shows the performance of the proposed model trained on different settings and evaluated on three decoding techniques: greedy, beam search (BS), and beam search with language model (BS+LM). The evaluation is done on the development dataset only to find out the best model for testing on the test dataset. From the ablation study, it is observed that the inclusion of both maximum entropy loss and horizontal flipping makes the model better. The best model outperforms other settings for all three decoding techniques. While comparing three methods of decoding, beam search decoding with the language model gives better performance.

The proposed hypothesis is that the maximum entropy loss helps in regularization and horizontal flipping data augmentation helps in the generalization of the model across data distribution. The purpose of the ablation study is to test this hypothesis. From the table, both these methods help in improving the model performance on the unseen data. Moreover, the combination of these two approaches boosts the performance even further, outperforming the state-of-the-art in development dataset.

In the context of time taken for training the proposed model, the average training time is about 6 ms per frame using one NVIDIA Tesla P100 GPU. The study in [24] has reported the average training time as 65 ms per frame on an NVIDIA Tesla K40c GPU. The difference in time taken is partially due to the parallel training of Transformers as compared to the sequential LSTM.

TABLE III: Comparison of letter accuracy for development dataset based on iterations
Model Experiment First Iter. Final Iter.
Shi et al. [24] Whole frame 23.0%(w/o LM) 43.6%
Proposed Model Whole frame 33.68%(w/o LM) 36.38%
Shi et al. [24] Face ROI 35.2% 46.8%
Proposed Model Face ROI 46.96% 45.01%

Table III shows the comparison of performance in development data for the first iteration and final iteration. [24] reports the letter accuracy of final iteration for both Whole frame and Face ROI using beam search decoding with the language model. However, for the first iteration in the Whole frame, only the performance with the beam search decoder is reported. So, the table also reports the performance of the proposed model in a similar manner to maintain consistency. According to the table, it is obvious that the performance of the proposed model in the first iteration outperforms in both the Whole frame and Face ROI case with a significant amount. However, for the Whole frame in the final iteration, the performance does not improve in comparison to the state-of-the-art although there is an improvement from the first to the final iteration. For the Face ROI case, there is a small difference in the performance in comparison to the first iteration.

VII Conclusion

In this paper, we have proposed a fine-grained visual attention approach for the sequence-to-sequence prediction task of fingerspelling recognition. We exploit the Transformer-based contextual attention mechanism for capturing the fine-grained details. The maximum entropy loss helped in the regularization and horizontal flipping data augmentation assisted in the generalization of the model. The proposed Transformer-based visual attention model significantly improved state-of-the-art performance in the ChicagoFSWild dataset. However, it is still low compared to that of human performance. We will further evaluate our model on a new large dataset ChicagoFSWild+, collected with crowdsourced annotators. The work could also be extended to end-to-end fashion for hand segmentation.

ACKNOWLEDGMENT

The work was supported in part by the Natural Sciences and Engineering Research Council of Canada (NSERC) under grant number RGPIN-2021-04244.

References

  • [1] N. C. Camgoz, O. Koller, S. Hadfield, and R. Bowden, “Sign language transformers: Joint end-to-end sign language recognition and translation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 023–10 033.
  • [2] F. Cen and G. Wang, “Boosting occluded image classification via subspace decomposition-based estimation of deep features,” IEEE transactions on cybernetics, vol. 50, no. 7, pp. 3409–3422, 2019.
  • [3] A. Dubey, O. Gupta, R. Raskar, and N. Naik, “Maximum entropy fine-grained classification,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018, pp. 635–645.
  • [4] G. Farnebäck, “Two-frame motion estimation based on polynomial expansion,” in Scandinavian conference on Image analysis.   Springer, 2003, pp. 363–370.
  • [5] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369–376.
  • [6] K. Heafield, I. Pouzyrevsky, J. H. Clark, and P. Koehn, “Scalable modified Kneser-Ney language model estimation,” in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers).   Sofia, Bulgaria: Association for Computational Linguistics, Aug. 2013, pp. 690–696. [Online]. Available: https://www.aclweb.org/anthology/P13-2121
  • [7] M. Jay, “American sign language,” https://www.startasl.com/american-sign-language/, 2021, accessed: 2021-03-05.
  • [8] T. Kim, J. Keane, W. Wang, H. Tang, J. Riggle, G. Shakhnarovich, D. Brentari, and K. Livescu, “Lexicon-free fingerspelling recognition from video: Data, models, and signer adaptation,” Computer Speech & Language, vol. 46, pp. 209–232, 2017.
  • [9] K. Li, N. Y. Wang, Y. Yang, and G. Wang, “Sgnet: A super-class guided network for image classification and object detection,” arXiv preprint arXiv:2104.12898, 2021.
  • [10] H. Liu, S. Jin, and C. Zhang, “Connectionist temporal classification with maximum entropy regularization,” Advances in Neural Information Processing Systems, vol. 31, pp. 831–841, 2018.
  • [11] S. Liwicki and M. Everingham, “Automatic recognition of fingerspelled words in british sign language,” in 2009 IEEE computer society conference on computer vision and pattern recognition workshops.   IEEE, 2009, pp. 50–57.
  • [12] D. Looney and N. Lusin, “Enrollments in languages other than english in united states institutions of higher education, summer 2016 and fall 2016: Preliminary report.” in Modern Language Association.   ERIC, 2018.
  • [13] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
  • [14] K. Montemurro and D. Brentari, “Emphatic fingerspelling as code-mixing in american sign language,” Proceedings of the Linguistic Society of America, vol. 3, no. 1, pp. 61–1, 2018.
  • [15] C. A. Padden and D. C. Gunsauls, “How the alphabet came to be used in a sign language,” Sign Language Studies, pp. 10–33, 2003.
  • [16] K. Papadimitriou and G. Potamianos, “End-to-end convolutional sequence learning for asl fingerspelling recognition.” in INTERSPEECH, 2019, pp. 2315–2319.
  • [17] M. Parelli, K. Papadimitriou, G. Potamianos, G. Pavlakos, and P. Maragos, “Exploiting 3d hand pose estimation in deep learning-based sign language recognition from rgb videos,” in European Conference on Computer Vision.   Springer, 2020, pp. 249–263.
  • [18] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, Eds.   Curran Associates, Inc., 2019, pp. 8024–8035. [Online]. Available: http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
  • [19] K. Patel, K. Li, K. Tao, Q. Wang, A. Bansal, A. Rastogi, and G. Wang, “A comparative study on polyp classification using convolutional neural networks,” PloS one, vol. 15, no. 7, p. e0236452, 2020.
  • [20] R. Rastgoo, K. Kiani, and S. Escalera, “Sign language recognition: A deep survey,” Expert Systems with Applications, p. 113794, 2020.
  • [21] U. Sajid, M. Chow, J. Zhang, T. Kim, and G. Wang, “Parallel scale-wise attention network for effective scene text recognition,” arXiv preprint arXiv:2104.12076, 2021.
  • [22] B. Shi, A. M. Del Rio, J. Keane, J. Michaux, D. Brentari, G. Shakhnarovich, and K. Livescu, “American sign language fingerspelling recognition in the wild,” in 2018 IEEE Spoken Language Technology Workshop (SLT).   IEEE, 2018, pp. 145–152.
  • [23] B. Shi and K. Livescu, “Multitask training with unlabeled data for end-to-end sign language fingerspelling recognition,” in 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).   IEEE, 2017, pp. 389–396.
  • [24] B. Shi, A. M. D. Rio, J. Keane, D. Brentari, G. Shakhnarovich, and K. Livescu, “Fingerspelling recognition in the wild with iterative visual attention,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 5400–5409.
  • [25] StartASL, “American sign language alphabet,” https://www.startasl.com/american-sign-language-alphabet/, 2021, accessed: 2021-03-05.
  • [26] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, pp. 6000–6010.
  • [27] World Federation of the Deaf, https://wfdeaf.org/our-work/, 2018, accessed: 2021-03-05.
  • [28] W. Xu, Y. Wu, W. Ma, and G. Wang, “Adaptively denoising proposal collection for weakly supervised object localization,” Neural Processing Letters, vol. 51, no. 1, pp. 993–1006, 2020.
  • [29] T. Zhang, X. Zhang, Y. Yang, Z. Wang, and G. Wang, “Efficient golf ball detection and tracking based on convolutional neural networks and kalman filter,” arXiv preprint arXiv:2012.09393, 2020.
  • [30] C. Zimmermann and T. Brox, “Learning to estimate 3d hand pose from single rgb images,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 4903–4911.