This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Keypoint based Sign Language Translation without Glosses

Youngmin Kim
Incheon National University
Incheon, Republic of Korea
winston121497@gmail.com
   Minji Kwak
Chung-Ang University
Seoul, Republic of Korea
minz@cau.ac.kr
   Dain Lee
Ewha Womans University
Seoul, Republic of Korea
na05128@ewhain.net
   Yeongeun Kim
Konkuk University
Seoul, Republic of Korea
zena0101@naver.com
   Hyeongboo Baek
Incheon National University
Incheon, Republic of Korea
hbbaek359@gmail.com
Abstract

Sign Language Translation (SLT) is a task that has not been studied relatively much compared to the study of Sign Language Recognition (SLR). However, the SLR is a study that recognizes the unique grammar of sign language, which is different from the spoken language and has a problem that non-disabled people cannot easily interpret. So, we’re going to solve the problem of translating directly spoken language in sign language video. To this end, we propose a new keypoint normalization method for performing translation based on the skeleton point of the signer and robustly normalizing these points in sign language translation. It contributed to performance improvement by a customized normalization method depending on the body parts. In addition, we propose a stochastic frame selection method that enables frame augmentation and sampling at the same time. Finally, it is translated into the spoken language through an Attention-based translation model. Our method can be applied to various datasets in a way that can be applied to datasets without glosses. In addition, quantitative experimental evaluation proved the excellence of our method.

1 Introduction

As the value of equal society has emerged, there is a social atmosphere in which various people should be respected and given equal opportunities. However, deaf people have difficulty communicating with non-disabled people not only in their daily lives but also in situations where they get the information they need to acquire [1].

Refer to caption
Figure 1: An overview of Our Approach. S2T means sign language to text model.

In order to solve this problem, research on sign language translation has long been conducted to help deaf people live their daily lives using computers [2]. These studies are intended to translate sign language videos into spoken language and aim to facilitate smooth communication between the deaf and non-disabled [3]. However, in the field of computer vision, the focus is on Continuous Sign Language Recognition (CSLR), which recognizes successive glosses rather than Sign Language Translation (SLT) [4] which converts sign language video directly into spoken language. CSLR is a task that recognizes the grammar of sign language itself, and it is difficult to provide meaningful interpretation only with CSLR because the grammar of sign language and the grammar of the spoken language are different.

Therefore, we have studied a method for translating directly from video to spoken language rather than a translation based on the grammar of sign language, which is possible even in the absence of gloss. So we try to achieve SLT using Neural Machine Translation (NMT) based methods. In this paper, we use the GRU [5] based Seq2Seq[6] model applying the Attention of Bahdanau et al [7].

Sign language is a visual language as the main language for communication for the deaf. And sign language, which is a visual language, uses many complementary channels to convey information [8]. These include not only hand movements but also nonverbal elements such as the signer’s facial expressions, mouth, and upper body movements, which play an important role in communication [9]. Therefore, this study proposes a sign language translation method based on the movement of the signer’s body. We perform keypoint-based SLT to more precisely determine the movement of the signer’s body.

The frame processing is essential to perform the signer’s keypoint-based SLT. The keypoint shows very different values according to the angle and position of the signer in the frame. To overcome and generalize the differences accordingly, the normalization method should be applied to the keypoint vector. In addition, a keypoint vector of a fixed length is required to perform SLT using the NMT model, and the existing keypoint-based sign language translation models [10, 11, 12, 13] go through a process of fixing the length of the sign language image through the frame sampling method. In this process, if the length of the video is short, the possibility that the key frames will not be included increases, which leads to the loss of information on the video. Conversely, when the length is set to be very large there is a disadvantage that the memory usage is very large while including a lot of frames.

To solve these problems, we propose a simple but effective method. First, we proposed a normalization method using the distance between each keypoint. Our normalization method proposes a stronger normalization method by varying normalization depending on the body part. We called it ”Customized Normalization.”

Second, in the process of adjusting the length of the sign language video, we propose a method of selecting the frame based on probability. Accordingly, it is possible to lower the priority of the less important part of the sign language video and increase the priority of the key frame. In addition, a method corresponding to the length of a dynamic video frame is proposed by simultaneously using not only the augmentation but also the sampling method depending on the video length. We called it ”Stochastic Augmentation and Skip Sampling(SASS)”.

In this paper, we propose a frame processing method that can perform well even with different video resolutions, using the Sign2Text method so that it can be applied to datasets without gloss. The overall overview of this paper follows Figure 1, this paper is summarized as follows:

\bullet

We propose a new normalization method that is best suited for sign language translation in keypoint processing.

\bullet

We propose a SASS method for extracting ’key frame’ considering the characteristics of videos with different lengths.

\bullet

Our method is a highly versatile methodology as it can be applied to datasets without glosses.

The rest of this paper is organized as follows: In Section 2, we survey the tasks of sign language translation and video processing methods. In Section 3, we introduce our video processing method and SLT model. In Section 4, experiments demonstrate how effective our method is. In Section 5, we analyze the effect of each method on performance through ablation experiments. In Section 6, we share the qualitative results of applying our method. Finally, Section 6 discusses the possibility of development of this study through conclusions and future work.

Refer to caption
Figure 2: A Full Architecture our SLT approach. We extract keypoints from each frame and then go through the process of normalizing features. Then, after adjusting the number of frames, encoding is performed. And after tokenizing the text, it is put into the decoder through the embedding process. In this case, GE is bi-GRU-Encoder, GD is GRU-Decoder. And X is encoded vector that enters the model.

2 Related Work

2.1 Sign Language Translation

As we mentioned earlier, research on sign language translation has been steadily conducted. However, there are several reasons for the slow development of the SLT system. First, there is not enough data for sign language translation. Sign language is used differently depending on each country and region, so the data that must be used when translating a language is different. Therefore, it was necessary to collect sign language datasets by country directly, which was difficult because of the large cost. In addition, previous studies only dealt with translation between sign language and text, and even though video information was not included, the average of total number of words was very small, about 3,000 [14, 15, 16]. However, as algorithms developed, they laid the foundation for building sign language datasets. In particular, with the development of algorithms for weakly announcement data [17, 18] and the development of algorithms for pose estimation [19, 20], researchers felt the need for sign language data and began to build datasets.

This development led to the creation of German sign language datasets RWTH-PHOENIX-Weather 2012 [21], RWTH-PHOENIX-Weather 2014 [22], and RWTH-PHOENIX-Weather 2014-T [4] as well as the American sign language dataset ASLG-PC12 [23] and Korean sign language dataset KETI [10].

With the construction of various datasets, the SLT task has developed through Deep Learning. In many CSLR or SLT studies, the frame is encoded in feature form through Convolution Neural Networks (CNNs) [24]. Camagoz et al. [4] extracted features of frames using CNN and proposed an SLT method through the Encoding-Decoding method [7] using Recurrent Neural Networks (RNN). Similarly, Zhou et al. [25] proposed a CSLR model that combines CNN and RNN.

As Transformer [26] was in the spotlight in the NMT task, many SLT models using Transformer were developed. Yin et al. [27] achieved the state of the arts by applying Transformer to the Zhou et al. [25] model called STMC, and Camagoz et al.[28] suggest a method of applying Transformer to [4] instead of RNN. In addition, there is a study that proposed a BERT-based SLT method using a pretrained Transformer, unlike the existing aspect [29].

There is also a skeleton-based SLT study using a model of the pose estimation task, not CNN. Ko et al. [10] achieved keypoint-based SLT on large-scale videos using KETI data sets. In addition, Jang et al. [13] perform translation using the Graph Convolution Network (GCN) [30] to extract and weight important clips. Skeleton-based research continues not only in SLT but also in the task of CSLR [31, 32, 33].

2.2 Video Processing

Video processing is mainly performed in action recognition tasks and video recognition tasks, which greatly influence performance [34, 35, 36]. Gowda et al. [37] has proposed a score-based frame selection method called SMART in the task action recognition, achieving state-of-the-arts for the UCF-101 dataset[38], and Karpathy et al. [39] used a video processing method that treats all videos as fixed clips. Therefore, video processing is a very useful way to improve performance in a variety of tasks.

Video processing mainly utilizes a method of changing the number of frames, using a method of increasing the number of frames by converting the frames themselves [40, 41], or an augmented method of manipulating the order of frames [42]. Recently, video processing is performed using Deep Learning, and there is also a study that proposed a frame augmentation method using Generative Advertising Networks (GAN) [43] or using a genetic algorithm [44].

The frame processing method was used not only in the Action Recognition task but also in the SLT task. Park et al. [12] augmented the data in three ways: angle conversion of the camera, finger length conversion, and random keypoint removal. In addition, Ko et al. [10] used a method of increasing the number of data by adding randomness to the new skip sampling method.

Table 1: List of keypoint. We divided it into four parts except for hands. The number is the index number of the corresponding part, and the reference point is the body part that is the reference for each part.
Part Reference point Included Number
Face Nose (0) 0,1,2,3,4,11
Upper Body Neck (12) 5,6,12
Left Arm Left Elbow (7) 7,9
Right Arm Right Elbow (8) 8,10

3 Proposed Approach

In this section, we introduce the overall architecture of the proposed method. Our architecture is decomposed into four parts: Extract of keypoint, keypoint normalization, SASS and sign language translation model. The overall flow of our proposed full architecture is described in Figure 2.

Camagoz et al. [4, 28] underwent the embedding process after extracting the feature of the sign language video frame through CNN. It does the aims to do the dimension reduction by embedding the feature point of the large amount extracted from the learning image. However, all the features extracted can include backgrounds, which can affect learning by becoming noise. Therefore, in this paper, instead of extracting features through CNN, we propose a method of extracting the keypoint of the signer and using it as a feature value. Using keypoint, we can construct a strong model against the background by only looking at the movement of the signer.

We propose a video processing method for sign language video processing, which emphasizes the ‘key frame’ of the video so that the model can better learn important parts of the video. First, we propose a “Customized Normalization” that further emphasizes important hand movements in sign language, considering the location of the keypoint. In addition, we propose a stochastic augmentation method that gives the ”key frame” of the video a stochastic weight so that the model can better learn important parts of the sign language video. To apply not only frame augmentation but also dynamic video frame length TT, ”SASS” is proposed, which is a mixture of stochastic augmentation and skip sampling. Finally, the sign language translation is achieved by applying the method of Bahdanau et al [7] for the NMT task.

Refer to caption
Figure 3: Location and number for the keypoints. We used 55 keypoints and excluded the keypoints of the lower body. Hand keypoints used both left and right hands. Then, we divided each normalization part by color.

3.1 Extraction Of Keypoint

We extract keypoint of the signer to obtain the feature value of the video. This method extracts skeletons points from the signer and minimizes the impact on the surrounding background, not the signer, in the video. We used a framework called Alphapose of Fang et al [20]. Alphapose can extract keypoints faster than the Cao, Zhe, et al OpenPose [19] model by first detecting signers in a top-down manner and then extracting keypoints from cropped images. We used pretrained model on a Halpe dataset [20]. There are 136 keypoints in total, and we used 123 keypoints by removing 13 keypoints to exclude the lower body. And 55 keypoints were used, excluding 68 detailed face keypoints, based on the better performances obtained when the keypoint of the face was removed in the study of Ko et al [10]. Therefore, the keypoint vector is composed of V=(v1,v2,v55)V=(v_{1},v_{2},\ldots v_{55}) and vv is the position coordinate vi=(vix,viy)wherei[1,55]v\ _{i}\ =\left(v\ _{i}^{x}\ ,v\ _{i}^{y}\ \right)\ \ \ where\ \ \ i\ \in\ \left[1,55\right] of each keypoint. The number of the location of the keypoint follows in Figure 3.

3.2 Customized Normalization

In this paper, we propose customized normalization according to body parts considering keypoint positions. Since the body length varies depending on the person, a normalization method considering this is needed. Although the Robust Keypoint Normalization method was mentioned in Kim et al. [11], it cannot be said to be a normalization method considering all positions because the reference point was set in some parts, not all parts. Therefore, this paper proposes a normalization method that can further emphasize not only the location of each keypoint but also the movement of the hand.

We designated each reference point according to the part of the body the face, upper body, left arm, and right arm. The reference point is defined as r=(rx,ry)r=(r_{x},r_{y}). The corresponding keypoint numbers according to each body part are shown in Table 1, and the names corresponding to each number can be referred to in Figure 3. In addition, a representative value for the entire body of the signer is defined as center point c=(cx,cy)c=(c_{x},c_{y}) and follows (1).

cx=155i=155vix,cy=155i=155viy\begin{gathered}c_{x}\ =\ \frac{1}{55}\sum_{i=1}^{55}{\ v_{i}^{x}},\ c_{y}\ =\ \frac{1}{55}\sum_{i=1}^{55}v_{i}^{y}\ \\ \end{gathered} (1)

In this case, vix,viyv^{x}_{i}\ \ ,\ v^{y}_{i} are values of Vx,VyV_{x}\ ,\ V_{y} vectors separated by xx and yy coordinates concerning the vector VV. VxV_{x} and VyV_{y} are defined as follows:

Vx=(v1x,v2x,,v55x)Vy=(v1y,v2y,,v55y)\begin{gathered}V_{x}\ =\ \left(v_{1}^{x},v_{2}^{x},\ \cdots,\ v_{55}^{x}\right)\\ V_{y}\ =\ \left(v_{1}^{y},v_{2}^{y},\ \cdots,\ v_{55}^{y}\right)\end{gathered} (2)

We make the elements of vectors VxV_{x} and VyV_{y} follows the (3).

Spoint={Nose,Neck,LElbow,RElbow}dpoint=(cxrx)2+(cyry)2where(rx,ry)Spoint(x~body,y~body)=(vxcxdpoint,vycydpoint)\begin{gathered}S_{point}\ =\ \{\ Nose,\ Neck,\ LElbow,\ RElbow\ \}\\ d_{point}\ =\ \sqrt{\left(c_{x}\ -\ r_{x}\right)^{2}\ +\ \left(c_{y}-r_{y}\right)^{2}}\\ where\ \left(r_{x},r_{y}\right)\ \in S_{point}\\ \left(\ \widetilde{\ x}_{body},\ \widetilde{\ y}_{body}\ \right)\ =\ \left(\frac{v^{x}-c_{x}}{d_{point}},\ \frac{v^{y}-c_{y}}{d_{point}}\right)\end{gathered} (3)

Equation (3) performs normalization based on the distance dpointd_{point} after obtaining the euclidean distance dpoint{d_{point}} between vector cc and the reference point according to each part. Therefore, the keypoint for the body excluding both hands follows (x~body,y~body)\left(\widetilde{x}_{body},\ \widetilde{y}_{body}\right).

Next, we perform MinMax scaling to make the keypoints of both hands equal in scale and compare the movements of both hands equally. MinMax Scaling is to adjust the range to [0,1] by dividing the difference between the maximum and minimum values of coordinates by subtracting the corresponding coordinates and the minimum values. Here, to prevent a situation in which the minimum value becomes 0 and the vanishing gradient problem occurs, we subtracted -0.5 and adjusted it to the range of [-0.5,0.5]. This follows (4). Therefore, we define the keypoints of both hands that have undergone this process as x~hand,y~hand\widetilde{x}_{hand},\widetilde{y}_{hand}. This follows (4):

(x~hand,y~hand)=(vxvminxvmaxxvminx 0.5,vyvminyvmaxyvminy 0.5)\begin{gathered}\left(\widetilde{x}_{hand}\ ,\ \widetilde{y}_{hand}\right)\\ =\ (\frac{v^{x}\ -\ v_{min}^{x}}{v_{max}^{x}\ -\ v_{min}^{x}}\ -\ 0.5\ ,\frac{v^{y}\ -\ v_{min}^{y}}{v_{max}^{y}\ -\ v_{min}^{y}}\ -\ 0.5\ )\end{gathered} (4)

where vmaxxv^{x}_{max} and vminxv^{x}_{min} are the maximum and minimum values of the hand keypoint, respectively. Thus, each xx and yy point to which our normalization method is applied is defined as vectors Vx=[x~body;x~hand]V_{x}^{*}\ =\ \left[\widetilde{x}_{body};\widetilde{x}_{hand}\right] and Vy=[y~body;y~hand]V_{y}^{\ast}\ =\ \left[\widetilde{y}_{body};\widetilde{y}_{hand}\right] and the final keypoint vector becomes V=[Vx;Vy]55×2V^{*}\ =\ \left[V_{x}^{*};V_{y}^{*}\right]\in\mathbb{N}^{{55}\times 2}.

Refer to caption
Figure 4: SASS This is our frame selection method. If the length TT of the video frame is less than a fixed scalar NN, a stochastic augmentation method is used, and if TT is greater than NN, skip sampling is used to match TT with the number of NN.

3.3 SASS

To train through deep learning, the input size of the model must be fixed. In previous studies, for this purpose, the dimension was reduced and sized through embedding [4, 28]. However, this results in a loss of information in the video. Therefore, we use a method that reflects the information about the sign language video as it is without using embedding. To this end, this study uses a method of fixing the length of the input value by augmenting or sampling the frame.

In this paper, we propose a general-purpose method that fixes the length of the input value but applies to various datasets. In addition, we emphasize the hand-moving frame, which is the key frame of the sign language video, and propose a method of not losing it. Since the frames existing at the beginning and end of the video are those with less hand movement of most signers, we propose an augmentation methodology that emphasizes the middle frame of the image except for this. And, when the video length is very long, a sampling method that does not lose the key frame is applied. In this paper, combining these two methods, we propose an augmentation and sampling method that takes into account the difference in frames for each video. We call this method ”Stochastic Augmentation and Skip Sampling (SASS)”.

We define the ii-th training video as xi={xt}t=0T\textsc{x}_{i}\ =\ \{\ x_{t}\ \}_{t=0}^{T} and LL as the number of training videos. where, i[1,L]i\in[1,L]. We select N frames from between the frames of the training video xi\textsc{x}_{i}, which follows (5).

N=1Li=0LxiN\ =\ \frac{1}{L}\ \sum_{i=0}^{L}\textsc{x}_{i} (5)

If the number of video frames xi\textsc{x}_{i} is greater than NN, the sampling is used, and if the number of video frame is less than NN, the number of frames in each video is adjusted to NN using the augmentation. So, input value XX is XL×N×|V|X\in\mathbb{R}^{L\times N\times|V^{*}|} This method can prevent the loss of video information, which is a disadvantage of frame sampling, and the disadvantage of frame augmentation, which can prevent a lot of memory consumption. The overall flow of our SASS method follows Figure 4.

3.3.1 Stochastic Augmentation

Our proposed method is not to follow a uniform probability when randomly selecting a frame, but to reconstruct a key-frame into a probability distribution that can be preferentially extracted with a high probability of being selected. The probability set in which the frame is selected through the change in the probability pp based on the binary distribution is constituted. We define a set of frame selection probabilities with the probability of pp as FpF_{p}. Therefore, FpF_{p} is defined as Fp={f(0;T,p),f(1;T,p),,f(T1;T,p)}F_{p}=\{f(0;T,p),f(1;T,p),\ldots,f(T-1;T,p)\} where function ff follows as (6):

f(k;T,p)=(T1k)pk(1p)Tk1wherek[0,T1]f(k;T,p)={T-1\choose k}p^{k}(1-p)^{T-k-1}\ \ where\ \ k\in[0,T-1] (6)

Key frames may be preferentially selected for each frame through the probability of having such a binomial distribution form.

We do not fix p to a single value but construct a set PrPr to construct several probability values. The set PrnPr_{n} follows (7).

Pr0={12}Prn=Prn1{1n+2,n+1n+2}pPrn\begin{gathered}Pr_{0}=\{\frac{1}{2}\}\\ Pr_{n}=Pr_{n-1}\cup\{\frac{1}{n+2},\frac{n+1}{n+2}\}\\ p\sim Pr_{n}\end{gathered} (7)

That is, the probability pp follows the probability set PrnPr_{n}, and accordingly, FpF_{p} is reconstructed. The reconstructed set of frame selection probabilities is called FpF^{*}_{p}, which follows (8).

Fp=1lppPrnFpF^{*}_{p}=\frac{1}{l_{p}}\sum_{p\in Pr_{n}}F_{p} (8)

where,lpl_{p} is length of PrnPr_{n} However, constructing FpF^{*}_{p} in this way does not augment the middle portion of the video, which is the original purpose of the probability. Therefore, we rearrange FpF^{*}_{p} based on the median to increase the probability of selection for the middle portion of the video.

Refer to caption
Figure 5: Change of probability distribution and kurtosis following nn. (Top) Probability distribution of probability set FnF^{*}_{n} according to the change in lpl_{p}. (Bottom) Kurtosis of FnF^{*}_{n} according to the change in lpl_{p}.

That is, if the frame order kk is smaller than T2\lfloor\frac{T}{2}\rfloor, it is sorted in ascending order, and if it is larger, it is sorted in descending order. Accordingly, a final set of frame selection probabilities is produced, and based on this, priority probabilities for frame augmentation order are obtained. Figure 5 shows the probability distribution and kurtosis of FpF^{*}_{p} according to lpl_{p}. As lpl_{p} increases, kurtosis increases.

Table 2: Comparison Of Normalization Method
Method KETI RWTH-PHOENIX-Weather 2014 T
BLEU-4 ROUGE-L METEOR BLEU-4 ROUGE-L METEOR
dev test dev test dev test dev test dev test dev test
Standard 78.99 76.44 78.84 76.28 78.84 76.28 12.08 7.36 24.74 23.57 26.93 25.87
Robust 79.17 76.29 79.07 76.23 79.08 76.18 9.19 8.04 24.73 23.57 26.92 25.86
MinMax 79.75 79.01 79.75 78.9 79.75 78.9 9.88 8.05 26.04 24.86 27.47 26.11
Kim et al[11] method 73.95 72.91 73.8 72.77 73.8 72.77 11.67 10.45 27.2 26.22 27.92 27.2
All reference 77.98 76.35 78.09 76.22 78.06 76.22 11.16 9.3 27.08 25.68 28.22 27.33
Center Reference 75.65 75.02 75.82 74.81 75.79 74.81 12.1 8.28 27.66 25.68 28.8 26.79
Ours 85.3 84.39 85.48 84.85 85.58 82.31 12.81 13.31 24.64 24.72 25.86 25.85

3.3.2 Skip Sampling

We chose a sampling method to avoid missing key frames when selecting a frame. The sampling method of Ko et al proposed a random sampling method that does not lose the key frame. Therefore, this paper also conducts sampling according to this method. When a fixed size is NN and the number of frames in the current video is TT, the average difference zz of frames follows the following (9).

c=TN1c=\lfloor\frac{T}{N-1}\rfloor (9)

SNS_{N} is the framing sequence of the video in the form of an arithmetic sequence and is defined as shown in (10) using zz.

SN=s+(N1)cwheres=Tc(N1)2{SN}NN\begin{gathered}S_{N}=s+(N-1)c\ \ where\ \ s=\frac{T-c(N-1)}{2}\\ \{S_{N}\}_{N\in\mathbb{N}^{N}}\end{gathered} (10)

Random sequence RR with a sequence range of [1,c][1,c] is added to the baseline sequence SNS_{N} defined in this way. Note that the value of the last index is clipped to the value in the range of [1,T][1,T]. If skip sampling is defined in this way, the order of frames can be considered without losing the key frame.

3.4 Sign Language Translation

We used a GRU-based model with a sequence to sequence (Seq2Seq) structure with an encoder and decoder in the translation step and improved translation performance by adding attention. This is a structure widely used in NMT and is suitable for translation tasks with variable-length inputs and outputs. We used Bahdanau attention [7]. The encoder-decoder structure goes through the following process when source sentence x=(x1,x2,,xTx)\textsc{x}=(x_{1},x_{2},\ldots,x_{T_{x}}) and target sentence y=(y1,y2,,yTy)\textsc{y}=(y_{1},y_{2},\ldots,y_{T_{y}}). Here, our model encodes a keypoint vector, not a sentence. First, the encoder calculates the hidden state and generates a context vector through it, and equation follows:

ht=f(xt,ht1)c=q(h1,,hTx)\begin{gathered}h_{t}=f(x_{t},h_{t-1})\\ c=q(h_{1},\ldots,h_{T_{x}})\end{gathered} (11)

hth_{t} is the hidden state calculated in time step tt and is the context vector generated through the encoder and nonlinear function. Next, in decoder, the joint probability can be calculated in the process of predicting words. This follows (12).

p(y)=i=1Typ(yi|{y1,y2,,yi1},c)p(\textsc{y})=\prod^{T_{y}}_{i=1}p(y_{i}|\{y_{1},y_{2},\ldots,y_{i-1}\},c) (12)

The conditional probability at each time used in the (12) can be calculated through the following equation.

p(yi|y1,y2,,yi1,x)=softmax(g(si))p(y_{i}|y_{1},y_{2},\ldots,y_{i-1},\textsc{x})=softmax(g(s_{i})) (13)

And in (13), sis_{i} is the hidden state calculated from the time ii of the decoder and follows the following equation.

si=f(yi1,si1,c)s_{i}=f(y_{i-1},s_{i-1},c) (14)

The context vector cic_{i} is calculated using a weight for hih_{i} and follows (15).

ci=j=1Tj=aijhjc_{i}=\sum^{T_{j}}_{j=1}=a_{ij}h_{j} (15)

where,

aij=exp(score(si1,hk))k=1Txexp(score(si1,hk)a_{ij}=\frac{exp(score(s_{i-1},h_{k}))}{\sum^{T_{x}}_{k=1}exp(score(s_{i-1},h_{k})} (16)

Here, score is an alignment function that calculates the match score between the hidden state of the encoder and the decoder.

Table 3: (TOP) Comparison of SLT Performance by Sampling Method
(Bottom) Comparison of SLT Performance by Augmentation Method
Method KETI RWTH-PHOENIX-Weather 2014 T
BLEU-4 ROUGE-L METEOR BLEU-4 ROUGE-L METEOR
dev test dev test dev test dev test dev test dev test
Random sampling 76.09 75.11 77.17 16.61 77.08 76.45 7.77 5.94 21.74 21.12 23.36 22.63
Skip Sampling 82.4 81.9 83.35 82.74 83.32 82.96 9.52 7.55 25.07 22.35 26.36 23.77
Stochastic Sampling 78.9 78.2 79.92 79.07 79.78 79.08 8.19 6.44 21.45 21.29 22.82
22.86
Random Augmentation 83.69 83.68 84.19 84.15 84.19 84.25 13.71 11.3 28.11 27.36 27.59 26.89
Stochastic Augmentation 82.25 83.33 83.19 80.06 80.74 80.75 13.74 12.65 28.08 27.59 26.36
25.73
Table 4: Comparison of combinations of sampling and augmentation
KETI RWTH-PHOENIX-Weather 2014 T
BLEU-4 ROUGE-L METEOR BLEU-4 ROUGE-L METEOR
dev test dev test dev test dev test dev test dev test
Skip + Random 84.88 84.24 85.25 84.75 85.29 84.89 11.64 11.5 27.63 27.63 29.69 28.68
Stochastic + Random 83.85 83.92 84.44 84.65 84.4 84.57 10.36 10.16 27.48 26.77 29.47 28.38
Stochastic + Stochastic 84.94 83.5 85.37 83.93 85.45 83.95 9.71 10.83 27.88 27.04 30.01
28.7
SASS (ours) 85.3 84.39 85.48 84.85 85.58 85.07 12.81 13.31 24.64 24.72 25.86
25.85

4 Experiments

To prove the effectiveness of our proposed method, we conduct several experiments on two diverse datasets: KETI[10], RWTH-PHOENIX-Weather 2014 T[4]. KETI dataset is a Korean sign language video consisting of 105 sentences and 419 words and has a full high definition (HD) video. There were 10 signers, and they filmed the video from two different angles. However, the KETI dataset is not provided with a test set, so we randomly split the dataset at a ratio of 8:1:1. The split of videos for train, dev and test is 6043, 800 and 801, respectively. We divided these videos into frames at 30 fps. KETI dataset did not provide gloss, so it immediately translated to the spoken language.

The RWTH-PHOENIX-Weather-2014-T dataset is a data set extended from the existing RWTH-PHOENIX-Weather 2014[22] with the public german sign language. The split of videos for train, dev and test is 7096, 519 and 642, respectively. It has no overlap with the RWTH-PHOENIX-Weather 2014.

The average number of frames of training videos in these two datasets are 153 and 116 frames, respectively. Therefore, the input X entering our model follows Xkor6403×153×110X_{kor}\in\mathbb{R}^{6403\times 153\times 110} and Xger6403×116×110X_{ger}\in\mathbb{R}^{6403\times 116\times 110}, respectively.

Our experiment was conducted in NVIDIA RTX A6000 and AMD EPYC 7302 16-Core environment for CPU. Our model was constructed using Pytorch[45], Adam Optimizer[46], and Cross-Entropy loss. The learning rate was set to 0.001 and epochs 100, and the dropout ratio was set to 0.5, preventing overfitting. Finally, the dimension of hidden states was 512.

We tokenize differently according to the characteristics of Korean and German. In Korean, the KoNLPy library’s Mecab part-of-speech (POS) tagger was used, and in German, the tokenization was performed through the nltk library [47]. Finally, our evaluation metric evaluates our model in three ways: BLEU-4[48], ROUGE-L[49], and METEOR[50] according to the metric of NMT.

We experiment on a total of four things. First, a comparative experiment according to the normalization method is conducted. Second, an experiment on video sampling and augmentation is conducted. Third, we experiment with changes in the length of set PrnPr_{n}(i.e. lpl_{p}), the representative value NN, and the sequence of the input value. Finally, we demonstrate the excellence of our model compared to previous studies.

4.1 Effect Of Normalization

This experiment proves that our method is powerful compared to various normalization methods. The results of this are shown in Table 2. We fixed and experimented with all elements except normalization method. We compared our method with the method called ”Standard Normalization(X=(xμ)/σX=(x-\mu)/\sigma)” using the standard deviation(σ\sigma) and average value(μ\mu) of points, the method called ”Min-Max Normalization(X=(xxmax)/(xmaxxmin)0.5X=(x-x_{max})/(x_{max}-x_{min})-0.5)” using the minimum(xminx_{m}in) and maximum values(xmaxx_{max}) of points and the method called ”Robust Normalization(X=(xx2/4)/(x3/4x1/4)X=(x-x_{2/4})/(x_{3/4}-x_{1/4}))” using the median(x2/4x_{2/4}) and interquartile range (IQR). where, IQR is x3/4x1/4x_{3/4}-x_{1/4}. In this case, standard normalization is the method proposed in Ko et al.

In addition, we experimented by adding two normalization methods that did not exist before. First, ”all reference” is a normalization method in which (3) is applied to the hand. Second, ”center reference”, is a method using only each keypoint vector vv and center vector cc without a separate reference point. This divides the difference between v and c by the distance. Finally, the reference point was compared with the Kim et al[11] method of normalization by fixing it with the right shoulder.

Our method proved to be the most powerful method through the experiment. It can be seen that the performance has improved a lot compared to the methods of Kim et al and Ko et al, which are previous studies. In addition, it has proven that it is a new powerful method that can improve performance by performing better not only in the dataset (KETI) with a relatively large scale frame but also in PHOENIX with a relatively small image.

4.2 Effect Of SASS

In this section, a difference in performance according to a frame selection method is experimented. Keypoints were normalized with our experimental normalization method, and everything else was fixed except the frame selection method. We demonstrate that our proposed method is excellent with a total of three comparative experiments. We conduct comparison experiments using only sampling, comparison experiments using only augmented, and comparison experiments using both sampling and augmentation methods.

First, we compare the experimental results when only sampling is performed. As the sampling method, the stochastic frame sampling and skip sampling methods proposed in this paper are compared. It is also compared with a random sampling technique that randomly selects a simple equal distribution without such additional processing.

In this case, the number of sampling was set to N=50N=50 for the KETI data set and compared to the Ko et al method, and in the PHOENIX data set, the minimum frame length N=16N=16 of the video was set. In the video sampling comparison experiment, the skip sampling technique showed the best results for the two datasets. In addition, probability sampling also differs significantly compared to random sampling following an even distribution.

Second, we experimented with the performance change when only frames were augmented. We set the maximum frame value of the training video in the dataset to NN, and set the NN of KETI and PHOENIX to 426 and 475, respectively. We compared and experimented with a randomly augmented method following an even distribution and a stochastic augmented method following our method. This is shown in Table 3.

Table 5: BLEU-4 score by lpl_{p}
lpl_{p} KETI PHOENIX
dev test dev test
3 84.38 83.1 11.25 11.17
5 84.49 83.27 12.23 10.48
7 84.4 83.97 12.42 11.74
9 83.41 83.85 14.84 12.28
11 84.35 83.45 13.49 12.03
13 84.78 83.31 12.31 12.01
15 84.76 81.68 12.64 11.68
17 85.3 84.39 12.81 13.31
Mean 84.48 83.38 12.75
11.84

At KETI, the random augmentation was good in every way. However, in PHEONIX, the main evaluation index, BLEU-4, was the best in stochastic augmentation. The overall performance was improved when the frame augmentation was performed compared to when only sampling was performed.

Finally, we experimented with a combination of Sampling and Augmentation. We experiment with a total of 4 cases. In the previous experiment, since the random sampling method among the sampling methods has poor results, the experiment was conducted by combining the number of cases except random sampling. This is shown in Table 4. The SASS method, which combines our methods of skip sampling and stochastic augmentation, performed well in both KETI, and BLEU-4 was overwhelmingly the highest in PHOENIX.

4.3 Additional Experiments

In this section, we experiment on several factors affecting the model. We compare the performance according to the change in the probability set PrnPr_{n} and compare according to NN.

And experiments are conducted on how to invert the order of input values, a method of enhancing the translation performance proposed in the study of Sutskever et al [6]. For the first time, we experiment with the performance change according to lpl_{p}, the number of probability sets PrnPr_{n}.

We experiment only with lp[3,5,7,9,11,13,15,17]l_{p}\in[3,5,7,9,11,13,15,17] and evaluate only with BLEU-4. This is shown in Table 5. We achieve the best performance on both datasets when lpl_{p} is 17. Even when compared to the average BLEU value, in the PHOENIX dataset, all of our methods are superior to other methodologies.

Table 6: BLEU-4 score by NN
NN KETI PHOENIX
dev test dev test
Mean 85.3 84.39 12.81 13.31
Median 85.22 82.78 12.65 9.27
Table 7: BLEU-4 score in order of input values
Order KETI PHOENIX
dev test dev test
Reverse 85.1 84.3 12.38 9.13
Original 85.3 84.39 12.81
13.31

Next, an experiment was conducted to compare according to the set criteria of NN. We set and compare NN according to the mean and median frame length of the training video. The median number of frames of training videos in these two datasets are 148 and 112 frames, respectively. The comparison accordingly is shown in Table 6.

Finally, we experiment with the effect of encoding by inverting the order of input XX. This method takes into account that in a linguistic system, there is a higher association between the words located in the front, so that the match with the target sentence can be better if the input order of the source sentence is reversed. Therefore, we reverse the order of frames and encode each video by changing the order of frames to (xT1,xT2,,x0)(x_{T-1},x_{T-2},...,x_{0}). This is shown in 7. This result shows that setting NN as the average has the best effect, and also that entering the original order of the frame has the best effect. It is analyzed that the change in the order of the input sentences is meaningless because our input value is the keypoint value of the frame, not the sentence.

4.4 Comparison Of Previous Works

In this section, we prove the excellence of our model by comparing existing studies with our model. KETI dataset was implemented directly and conducted a comparative experiment because we randomly divided it into the training, validation, and test sets. In Ko et al’s study, additional data was used, and we reproduced and compared the proposed model without adding data. This is shown in Table 8.

In addition, since we are a robust method of experimenting with a gloss-less dataset, PHOENIX dataset only compares video to text translation, not gloss-based methods. This is shown in Table 9.

Our model also performs well compared to previous studies in the dataset. In particular, it can be seen that KETI-dataset exhibits the best performance among keypoint-based SLT models. In addition, the BLEU score was the highest in the PHOENIX dataset. BLEU-4 increased significantly (about 4%p) compared to Sign2Text (Bahdanadu Attention), indicating that the ideal performance was improved through the Our video processing method even though the same Bahadanadu Attention was used.

Table 8: Comparison with other models in KETI dataset
BLEU-4 ROUGE-L METEOR
dev test dev test dev test
Ko et al[10] 77.96 76.44 77.99 76.56 77.99 76.49
Kim et al[11] - 78.00 - 81.00 - 83.00
Ours 85.3 84.39 85.48 84.85 85.58 85.07
Table 9: Comparison with other models in PHOENIX dataset
BLEU-4 ROUGE-L METEOR
dev test dev test dev test
Sign2text
(Luong)[4]
10.00 9.00 31.8 31.8 - -
Sign2Text
(Bahdanau)[4]
9.94 9.58 32.6 30.7 - -
Ours 12.81 13.31 24.64 24.72 25.86 25.85
Table 10: Ablation Study of Video Processing Methods. ”Normalization” means our customized normalization method. ”Sampling” is our skip sampling method and ”Augmentation” is our stochastic augmentation method.
KETI RWTH-PHOENIX-Weather-2014 T
Normalization Sampling Augmentation BLEU-4 ROUGE-L METEOR BLEU-4 ROUGE-L METEOR
76.44 76.56 76.49 7.48 21.91 23.75
77.00 77.62 77.81 11.4 24.62 24.93
77.78 78.42 78.61 7.36 24.5 26.21
81.9 82.74 82.96 7.55 22.35 23.77
80.06 80.74 80.75 12.65 27.59 25.73
84.39 84.85 85.07 13.31 24.72
25.85

5 Ablation Study

We conduct ablation studies to analyze the effectiveness of the our proposed method. We experimented with the change in performance depending on the presence or absence of our normalization and frame selection techniques. We study the following seven components of our method. This is shown in Table 10.

Table 11: Translation from our networks (GT : Ground Truth)
RWTH-PHOENIX-Weather-2014 T
Reference
und nun die wettervorherage
ür morgen amtag den zweiten April .
(and now the weather forecast
for tomorrow morning the second April .)
Ours
und nun die wettervorhersage
für morgen samstag den achtundzwanzigsten juli .
(and now the weather forecast for
tomorrow Saturday the twenty-eighth of July.)
Reference liebe zuchauer guten abend .
(love too much good evening .)
Ours liebe zuchauer guten abend .
(love too much good evening .)
Reference
der wind weht meit mäßig im oten und an der ee richer wind
mit tarken böen in den hochlagen der othälte turmböen .
(The wind blows moderately in the east and at the east
with tarke böen in the thaw of the eastern turbines .)
Ours
der wind meist schwach mit mäßig bis mäßig mit
schauern sonst kommt auch frisch sonst aus .
(The wind is usually weak with moderate to moderate
with a good view, otherwise it will be fresh .)
Reference
und nun die wettervorherage ür
morgen mittwoch den dreißigten märz .
(and now the weather forecast
for tomorrow, wednesday, march 30th .)
Ours
und nun die wettervorhersage
für morgen mittwoch den zwanzigsten oktober .
(and now the weather forecast
for tomorrow, wednesday, october 20th .)
KETI
Reference 가스가 새고 있어요 .
(Gas is leaking .)
Ours 가스가 새고 있는 것 같아요 .
(I think gas is leaking .)
Reference 남편이 높은데서 떨어져서 머리에 피나요 .
(My husband fell from a height and it bleeds from his head .)
Ours 남편이 높은데서 떨어져서 머리에 피나요 .
(My husband fell from a height and it bleeds from his head .)
Reference 술 취한 사람이 방망이로 사람들을 때리고 있어요 .
(A drunken man is hitting people with a bat .)
Ours 술 취한 사람이 방망이로 사람들을 때리고 있어요 .
(A drunken man is hitting people with a bat .)

We compare the use of our normalization methods. If our normalization method was not used, the standard normalization method was used. And the number of all cases was experimented with by applying different frame selection methods. Only skip sampling method was used for sampling, and only stochastic augmentation method was used for augmentation. Finally, the SPSS method, which is our method combining the two methods, is compared when applied.

Through this experiment, we prove that we perform better when using Customized Normalization, and also that we perform best when using our method, SPSS.

6 Qualitative Results

In this section we report our qualitative results. When given a sign language video, we share the results generated by our SLT model. In addition, when the KETI dataset is given, Korean and English expressions are shown in a table together. (See the Table 11).

PHOENIX predicts relatively short sentences well. However, when the sentence is longer, the end of the sentence is relatively difficult to predict. KETI predicts almost all sentences well and translates almost accurately long sentences.

7 Conclusion And Future Work

In this paper, we proposed a keypoint-based SLT model and proposed an optimal sign video processing method suitable for it. We introduced a new normalization method suitable for sign language video keypoint vector processing, which was not in previous studies. In addition, we proposed a method of robustness over multiple datasets by setting the number of input frames appropriate for each dataset without setting the number of input frames with the same arbitrary fixed number NN for all datasets. Finally, similar to the method of Camagoz et al[4], a method of solving the SLT problem as an NMT problem was introduced.

To prove the excellence of this method, we experimented with both high-resolution datasets (KETI) of video frames and low-resolution datasets (RWTH-PHOENIX-Weather-2014 T) of video frames. Through experimental evaluation, we have proved that our method is robust in sign language datasets with various sizes. In addition, our video processing method has the generality that can be used not only in the sign language translation task but also in many tasks that video processing.

In future work, we will implement Sign-to-Gloss-to-Text (S2G2T) based on our video processing method to achieve sign language translation using gloss. Based on Camagoz et al[28] and Yin et al[27], we expect better performance if we translate text after gloss translation. In addition, it is expected that various NMT models will be applied to achieve better performance improvement than before.

References

  • [1] A. S. Santos and A. J. F. Portes, “Perceptions of deaf subjects about communication in primary health care,” Revista Latino-Americana de Enfermagem, vol. 27, 2019.
  • [2] S. Tamura and S. Kawasaki, “Recognition of sign language motion images,” Pattern recognition, vol. 21, no. 4, pp. 343–353, 1988.
  • [3] K. Cormier, N. Fox, B. Woll, A. Zisserman, N. C. Camgöz, and R. Bowden, “Extol: Automatic recognition of british sign language using the bsl corpus,” in Proceedings of 6th Workshop on Sign Language Translation and Avatar Technology (SLTAT) 2019, University of Surrey, 2019.
  • [4] N. C. Camgoz, S. Hadfield, O. Koller, H. Ney, and R. Bowden, “Neural sign language translation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7784–7793, 2018.
  • [5] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
  • [6] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” Advances in neural information processing systems, vol. 27, 2014.
  • [7] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
  • [8] R. Sutton-Spence and B. Woll, The linguistics of British Sign Language: an introduction. Cambridge University Press, 1999.
  • [9] P. B. Braem and R. Sutton-Spence, The Hands Are The Head of The Mouth. The Mouth as Articulator in Sign Languages. Hamburg: Signum Press, 2001.
  • [10] S.-K. Ko, C. J. Kim, H. Jung, and C. Cho, “Neural sign language translation based on human keypoint estimation,” Applied Sciences, vol. 9, no. 13, p. 2683, 2019.
  • [11] S. Kim, C. J. Kim, H.-M. Park, Y. Jeong, J. Y. Jang, and H. Jung, “Robust keypoint normalization method for korean sign language translation using transformer,” in 2020 International Conference on Information and Communication Technology Convergence (ICTC), pp. 1303–1305, IEEE, 2020.
  • [12] C.-I. Park and C.-B. Sohn, “Data augmentation for human keypoint estimation deep learning based sign language translation,” Electronics, vol. 9, no. 8, p. 1257, 2020.
  • [13] S. Gan, Y. Yin, Z. Jiang, L. Xie, and S. Lu, “Skeleton-aware neural sign language translation,” in Proceedings of the 29th ACM International Conference on Multimedia, pp. 4353–4361, 2021.
  • [14] S. Morrissey, H. Somers, R. Smith, S. Gilchrist, and S. Dandapat, “Building a sign language corpus for use in machine translation,” Representation and Processing of Sign Languages: Corpora and Sign Language Technologies, 2010.
  • [15] C. Schmidt, O. Koller, H. Ney, T. Hoyoux, and J. Piater, “Using viseme recognition to improve a sign language translation system,” in Proceedings of the 10th International Workshop on Spoken Language Translation: Papers, 2013.
  • [16] D. Stein, C. Schmidt, and H. Ney, “Analysis, preparation, and optimization of statistical sign language machine translation,” Machine Translation, vol. 26, no. 4, pp. 325–357, 2012.
  • [17] P. Buehler, A. Zisserman, and M. Everingham, “Learning sign language by watching tv (using weakly aligned subtitles),” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2961–2968, IEEE, 2009.
  • [18] H. Cooper and R. Bowden, “Learning signs from subtitles: A weakly supervised approach to sign language recognition,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2568–2574, IEEE, 2009.
  • [19] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7291–7299, 2017.
  • [20] H.-S. Fang, S. Xie, Y.-W. Tai, and C. Lu, “Rmpe: Regional multi-person pose estimation,” in Proceedings of the IEEE international conference on computer vision, pp. 2334–2343, 2017.
  • [21] J. Forster, C. Schmidt, T. Hoyoux, O. Koller, U. Zelle, J. Piater, and H. Ney, “Rwth-phoenix-weather: A large vocabulary sign language recognition and translation corpus,” in Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pp. 3785–3789, 2012.
  • [22] J. Forster, C. Schmidt, O. Koller, M. Bellgardt, and H. Ney, “Extensions of the sign language recognition and translation corpus rwth-phoenix-weather,” in Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pp. 1911–1916, 2014.
  • [23] A. Othman and M. Jemni, “English-asl gloss parallel corpus 2012: Aslg-pc12,” in 5th Workshop on the Representation and Processing of Sign Languages: Interactions between Corpus and Lexicon LREC, 2012.
  • [24] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, 2012.
  • [25] H. Zhou, W. Zhou, Y. Zhou, and H. Li, “Spatial-temporal multi-cue network for continuous sign language recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13009–13016, 2020.
  • [26] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  • [27] K. Yin and J. Read, “Better sign language translation with stmc-transformer,” in Proceedings of the 28th International Conference on Computational Linguistics, pp. 5975–5989, 2020.
  • [28] N. C. Camgoz, O. Koller, S. Hadfield, and R. Bowden, “Sign language transformers: Joint end-to-end sign language recognition and translation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10023–10033, 2020.
  • [29] M. De Coster, K. D’Oosterlinck, M. Pizurica, P. Rabaey, M. Van Herreweghe, J. Dambre, and S. Verlinden, “Frozen pretrained transformers for neural sign language translation,” in 18th Biennial Machine Translation Summit, pp. 88–97, 2021.
  • [30] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
  • [31] S. Jiang, B. Sun, L. Wang, Y. Bai, K. Li, and Y. Fu, “Skeleton based sign language recognition using whole-body keypoints,” arXiv e-prints, pp. arXiv–2103, 2021.
  • [32] Q. Xiao, M. Qin, and Y. Yin, “Skeleton-based chinese sign language recognition and generation for bidirectional communication between deaf and hearing people,” Neural networks, vol. 125, pp. 41–55, 2020.
  • [33] W. Liang and X. Xu, “Skeleton-based sign language recognition with attention-enhanced graph convolutional networks,” in CCF International Conference on Natural Language Processing and Chinese Computing, pp. 773–785, Springer, 2021.
  • [34] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” Advances in neural information processing systems, vol. 27, 2014.
  • [35] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of the IEEE international conference on computer vision, pp. 4489–4497, 2015.
  • [36] C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 6202–6211, 2019.
  • [37] S. N. Gowda, M. Rohrbach, and L. Sevilla-Lara, “Smart frame selection for action recognition,” arXiv preprint arXiv:2012.10671, 2020.
  • [38] K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012.
  • [39] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1725–1732, 2014.
  • [40] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, “Beyond short snippets: Deep networks for video classification,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4694–4702, 2015.
  • [41] Q. Ke, S. An, M. Bennamoun, F. Sohel, and F. Boussaid, “Skeletonnet: Mining deep part features for 3-d action recognition,” IEEE signal processing letters, vol. 24, no. 6, pp. 731–735, 2017.
  • [42] M. Hoai and A. Zisserman, “Improving human action recognition using score distribution and ranking,” in Asian conference on computer vision, pp. 3–20, Springer, 2014.
  • [43] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in neural information processing systems, vol. 27, 2014.
  • [44] C. Vondrick, H. Pirsiavash, and A. Torralba, “Generating videos with scene dynamics,” Advances in neural information processing systems, vol. 29, 2016.
  • [45] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, 2019.
  • [46] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [47] E. Loper and S. Bird, “Nltk: The natural language toolkit,” arXiv preprint cs/0205028, 2002.
  • [48] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318, 2002.
  • [49] C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,” in Text summarization branches out, pp. 74–81, 2004.
  • [50] S. Banerjee and A. Lavie, “Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,” in Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72, 2005.