This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

AFE-CNN: 3D Skeleton-based Action Recognition with Action Feature Enhancement

Shannan GUAN, Haiyan LU, Linchao ZHU, Gengfa FANG Australia Artificial Intelligence Institute School of Electrical and Data Engineering University of Technology Sydney, Australia, AU
Abstract

Existing 3D skeleton-based action recognition approaches reach impressive performance by encoding handcrafted action features to image format and decoding by CNNs. However, such methods are limited in two ways: a) the handcrafted action features are difficult to handle challenging actions, and b) they generally require complex CNN models to improve action recognition accuracy, which usually occur heavy computational burden. To overcome these limitations, we introduce a novel AFE-CNN, which devotes to enhance the features of 3D skeleton-based actions to adapt to challenging actions. We propose feature enhance modules from key joint, bone vector, key frame and temporal perspectives, thus the AFE-CNN is more robust to camera views and body sizes variation, and significantly improve the recognition accuracy on challenging actions. Moreover, our AFE-CNN adopts a light-weight CNN model to decode images with action feature enhanced, which ensures a much lower computational burden than the state-of-the-art methods. We evaluate the AFE-CNN on three benchmark skeleton-based action datasets: NTU RGB+D, NTU RGB+D 120, and UTKinect-Action3D, with extensive experimental results demonstrate our outstanding performance of AFE-CNN.

keywords:
3D Skeleton, Action Recognition, Feature Enhance, Attention
journal: Journal of  Templates

1 Introduction

Human action recognition task can be roughly processed as two steps: 1) action features extraction and 2) action classification, wherein the quality of action features largely impact the action classification accuracy. Early works [1, 2, 3, 4, 5] extract action features from RGB videos, which however suffer from background noise and illumination changes. Recently, depth camera (e.g., Kinect sensor) emerges as a powerful tool to acquire 3D skeleton data. Compared with RGB videos, 3D skeleton data are more robust to background noise and illumination changes, and can effectively avoid cluttered backgrounds and irrelevant objects. As a result, significant efforts have been devoted in recent year to the 3D skeleton-based human action recognition task.

Existing works using 3D skeleton data apply conventional classifiers, e.g., k-Nearest Neighbor [6], Support Vector Machine [7], which can easily recognize actions by feeding raw skeletons. However, these conventional methods [8, 9] cannot adapt to challenging actions such as human-to-human interaction, human-to-object interaction and are infeasible to apply on large scale datasets [10]. To overcome these short comes, recent efforts [11, 12, 13, 14, 15] adopt deep learning-based methods (e.g., recurrent neural network (RNN) [16], long short term memory (LSTM) [14], graph convolutional network (GCN) [17] and convolutional neural network (CNN) [11]) to recognize actions. Amongst deep models, RNN, LSTM, and GCN-based models naturally suffer from increasing model complexity (e.g., a large number of input features) and computational burden to improve action recognition accuracy. Moreover, they are easily trapped in overfitting problems and cannot directly learn high-level features with spatio-temporal information [10].

On the contrary, CNN-based methods encode high-level features and spatio-temporal information in an image, which can effectively reduce model overfitting and represent 3D skeleton data in a more comprehensive way. The core of developing CNN-based action recognition models is to effectively encode features from 3D skeleton data so as to improve the recognition accuracy [18]. Existing approaches to CNN-based action recognition adopt handcrafting approaches [19, 20, 10, 21, 22, 23, 18] to encode features to images. Specifically, these handcrafting approaches focus on compressing more features (e.g., skeleton motion [22], joint angles [10]) in one image and employee a state-of-the-art CNN architecture to decode features and recognize actions. However, such handcrafted images are difficult to handle challenging data which consist various of camera views, body sizes and marginally different actions (e.g., writing and typing).

In this paper, we propose the AFE-CNN: a learning-based action feature enhance model to enhance features of 3D skeleton-based actions for better encoding of actions to image format. Specifically, compared with handcrafted action feature-based images, our AFE-CNN utilizes learning-based methods to enhance features of 3D skeleton-based actions from key joints and bone vectors perspectives, which makes the action recognition model more robust to various camera views and body sizes. Then, a light-weight four-stream CNN model is deployed to learn comprehensive information in action feature-enhanced images and recognize actions. Notably, our designed action feature enhance modules can effectively improve model generalization on challenging actions (e.g., reading, writing). Furthermore, our AFE-CNN emphasizes key frames in skeleton sequences by a frame attention module, and we also embed temporal information in transformed images for enhancing the temporal information. Finally, we train our AFE-CNN end to end and achieves 86.2% on cross-subject metric and 92.2% on cross-view metric on the benchmark dataset NTU-RGB+D. Furthermore, our AFE-CNN not only achieves outstanding performance, but also very low computational costs, e.g., it costs 3.5ms for one forward inference. Our main contributions are summarized as three folds:

  • 1.

    We propose a novel learning-based model to enhance the features of 3D skeleton-based actions, which achieves the state-of-the-art performance on three benchmark datasets for the action recognition task.

  • 2.

    We design action feature enhance modules to adapt to various camera views and body sizes, thus significantly improves the recognition accuracy of our proposed model on challenging actions.

  • 3.

    Our AFE-CNN adopts a light-weight CNN architecture to alleviate the computation burden and can largely reduce computing time.

2 Related Works

The most relevant works to our studies can be roughly classified into two categories. First category contains deep learning-based approaches for 3D skeleton-based action recognition, and the second category contains the action features via visual representation approaches for 3D skeleton-based action recognition.

2.1 Deep Learning-based Action Recognition

To recognize an action, it generally requires spatio-temporal information in a 3D skeleton sequence. Therefore, RNN and LSTM network can demonstrate their advantages on spatio-temporal information processing. Liao et al. [16] encoded the spatio-temporal information from streamed 3D skeleton data, wherein a novel hybrid RNN architecture decode them to recognize actions. To reduce the effect on irrelevant joints, Liu et al. [24] designed a global context-aware attention LSTM to selectively learn the spatio-temporal information from most relevant joints. In [25], Si et al. novelty combined a spatial reasoning network and a temporal stack learning network to learn the spatial structural information and temporal information of skeleton sequences. With the similar idea of combining networks, Zhang et al. [26] combined two streams LSTM networks, one is to decode spatio-temporal information in skeleton sequence, another is to transform viewpoints to address various of camera views.

From the perspective of 3D skeleton structure, the key joints are connected with each other by bone links, which contain rich spatial information. Hence, the GCN-based model shows another way of thinking to represent spatio-temporal information in a 3D skeleton sequence. Li et al. [27] adopted a ”two stream” strategy and proposed an action-structural graph convolution network to capture actional links and structural links, which can learn both spatial and temporal features for action recognition. To capture variation of spatio-temporal information in 3D skeleton data, Gao et al. [28] utilized spatio-temporal modeling of 3D skeleton data and applied optimization on consecutive frames for efficient spatio-temporal data representation. In [29], Papadopoulos et al. adopted two novel GCN-based models to capture vertex features and short/long-term temporal features for action recognition. However, such GCN-based methods generally have a low computational efficiency due to their complex network structures.

2.2 Action Features via Visual Representation

Generally, a 3D skeleton sequence can be easily encoded into one image and further decoded by a CNN-based model, thus the recent CNN-based works mostly make effort on exploring features in 3D skeleton data and represent them by images. Kim et al. [19] directly transformed frame-wise 3D coordinates of joints to images and decoded them by using a residual temporal CNN model. In [20], Li et al. manually transform raw 3D skeleton data to images by utilizing joint reference and projection, and employed a pre-trained VGG-19 model for end to end training. Despite the success on utilizing high level features, representing low-level features (e.g., joint angles, motion direction and magnitude) are also considered as an effective approach to improve the accuracy of 3D skeleton-based action recognition. Kim et al. [10] represented the features by using the spatial correlations of joints and temporal dynamics of 3D skeleton, and achieved remarkable performance. To eliminate the effect of camera view variations, Liu et al. [23] presented an skeleton feature enhanced method for view invariant action recognition, to further improve the performance, the skeleton motion enhancement methods are also applied. In [18], Li et al. separated the skeleton into several body parts and mapped them to images via a scale invariant transformation approach to eliminate the effect of different body sizes. Combining GCN and CNN is also a novel approach to explore spatio-temporal information in 3D skeleton data. Zhang et al. [30] developed an simple yet effective GCN-based model to enhance the joint feature by introducing semantics of joints and hierarchically exploit their relationship, and used CNN model to explore correlation across frames.

Generally speaking, the spatio-temporal information in 3D skeleton data can be easily encoded in image format and effectively decoded by CNN-based models. However, the existing approaches using handcrafted action feature images are difficult to handle challenging actions and generally require a large scale CNN model to decode spatio-temporal information in image patterns. Therefore, it is highly desirable to develop a more efficient method to encode action features that can be decoded by a light-weight CNN model for reducing computational burden.

3 Methodology

Refer to caption
Figure 1: Pipeline of our AFE-CNN. First, the 3D skeleton sequence is initially transformed into four images by the Action Feature Enhance module. Then, these four images are fed into a CNN-based Action Recognition Model for action recognition.

3.1 Model Overview

As shown in Fig 1, our AFE-CNN contains two main models: Action Feature Enhance model and Action Recognition model. In the Action Feature Enhance model, there are four blocks: 1) Skeleton Sequence Image Transform block; 2) Multi-Frame enhance block; 3) Skeleton Motion Velocity Image Transform block and 4) Temporal Embedding block.

In Block 1, the input is 3D skeleton sequences and the outputs contain two parts. The first part has three images, namely Multi-Frame Attention Map (MFAM), Key-Joints Feature Enhance Image (KJEI) and Bone-Vectors Feature Enhance Image (BVEI). Specifically, MFAM is calculated from skeleton frames through the Multi-Frame Attention Module. KJEI specifies key joints of 3D skeleton sequences which is produced by the Key-Joints Feature Enhance Module. BVEI specifies bone vectors of 3D skeleton sequences through a Bone-Vectors Feature Enhance Module. The second part contains two matrices, namely 𝐌\mathbf{M} and 𝐍\mathbf{N}, where 𝐌\mathbf{M} represents the enhanced key joints by the Key-Joints Feature Enhance Module, and 𝐍\mathbf{N} represents the enhanced bone vectors from the skeleton sequence by the Bone-Vectors Feature Enhance Module. In Block 2, the inputs are MFAM, KJEI and BVEI from block 1 and the outputs are two new images, namely the frame-enhanced KJEI (F-KJEI) and the frame-enhanced BVEI (F-BVEI). In Block 3, the inputs are 𝐌\mathbf{M} and 𝐍\mathbf{N}. The outputs are two images, namely Key-Joints Motion Velocity Image (KJVI) and Bone-Vectors Motion Velocity Image (BVVI), which are generated by the joint velocity transform model. In Block 4, the Temporal Embedding module takes four images from blocks 2 and 3 as the inputs and generates four feature enhanced action-transformed images, namely temporal frame-enhanced KJEI (TF-KJEI), temporal frame-enhanced BVEI (TF-BVEI), temporal-enhanced KJVI (T-KJVI), and temporal-enhanced BVVI (T-BVVI), which include the temporal information of the skeleton sequence. The action recognition model consists of four light-weight convolutional neural networks and several fully connected layers. The inputs are the four enhanced action-transformed images, i.e., TF-KJEI, TF-BVEI, T-KJVI, and T-BVVI from Block 4 and the output is the action label.

Refer to caption
Figure 2: Detailed structures of proposed three modules, (a) Key-Joint Feature Enhance Module, (b) Bone-Vectors Feature Enhance Module, and (c) Multi-Frame Attention Module.

3.2 Action Feature Enhance Model

In this section, we will discuss the details about five modules in the action feature enhance model.

3.2.1 Key-Joints Feature Enhance Module

Generally, actions with subtle limb motion differences could only cause pixel-level differences among their action-transformed images. For instance, the transformed image of “writing” is slightly different from the counterpart of “typing”, since the limb motion differences only reflect by their fingers. As a result, such subtle motion differences could not be picked up by CNN-based models, leading a sub-optimal performance in action recognition. To enlarge the differences among the transformed images of actions, we propose a Key-Joints Feature Enhance Module that uses the key-joints of given skeleton sequences to enhance features of transformed images.

Let 𝒳={𝐗t3×Jt=1,,T}\mathcal{X}=\left\{\mathbf{X}^{t}\in\mathbb{R}^{3\times J}\mid t=1,\ldots,T\right\} denote the skeleton frames in a given skeleton sequence, where 𝐗t\mathbf{X}^{t} is a matrix with size 3×J3\times J that represents 3D coordinates of JJ joints in frame tt, and TT is the number of frames in the skeleton sequence. As shown in Fig 2 (a), the 𝒳\mathcal{X} firstly passes a concatenation step to form a T×J×3T\times J\times 3 tensor. This tensor then passes two fully connected layers with leakyReLU [31] activation function and outputs a scale matrix 𝐖1×J×1\mathbf{W}\in\mathbb{R}^{1\times J\times 1}. In the parallel, the tensor is transformed to a matrix of size J×T×3J\times T\times 3, which is treated as a 3-channel image of size J×TJ\times T sized 3-channel image: 𝐗~3×J×T\widetilde{\mathbf{X}}\in\mathbb{R}^{3\times J\times T}. Finally, the scaled joints matrix is calculated by:

𝐌=𝐖×𝐗~\mathbf{M}=\mathbf{W}\times\widetilde{\mathbf{X}} (1)

Thereafter, the 𝐌\mathbf{M} is fed to an Embedding layer for linear transformation, where the embedding layer is a T×JT\times J weight matrix, to produce the key-joints feature enhanced image 𝐌~3×T×T\widetilde{\mathbf{M}}\in\mathbb{R}^{3\times T\times T}, which is referred to as KJEI.

3.2.2 Bone-Vectors Feature Enhance Module

Having captured key-joints features through our Key-Joints Feature Enhance Module, we further include a Bone-Vectors Feature Enhance Module to enhance the transformed images using bone vector features. The bone vectors contain more details of human actions that could not be captured by less ideal camera views, e.g., the ones taken by cameras from the back-side. A camera standing behind humans can only film the human back, and information about human hands motion and gestures cannot be captured. Fortunately, the bone vectors can provide integral human body maps to fill the losing motion information. We now introduce our Bone-Vectors Feature Enhance Module that includes bone vectors of humans to enhance the transformed images of actions.

Let 𝒑k\boldsymbol{p}_{k} denotes the 3D coordinate of kk-th joint and 𝒃k\boldsymbol{b}_{k} denotes the bone vector formed by the rr-th joint and qq-th joint, which can also be represented as 𝐗𝒄\mathbf{X}\boldsymbol{c}, and calculated as follows:

𝒃k=𝒑r𝒑q=𝐗𝒄k\boldsymbol{b}_{k}=\boldsymbol{p}_{r}-\boldsymbol{p}_{q}=\mathbf{X}\boldsymbol{c}_{k} (2)

where 𝒄k=(0,,0,1,0,,0,1,0,,0)T\boldsymbol{c}_{k}=(0,\ldots,0,1,0,\ldots,0,-1,0,\ldots,0)^{T}, with 11 and 1-1 indicating 𝒑r\boldsymbol{p}_{r} and 𝒑t\boldsymbol{p}_{t}, respectively. For each 𝐗t\mathbf{X}^{t}, bone vector 𝒃kt\boldsymbol{b}_{k}^{t} is calculated and concatenated to form 𝐁t=(𝒃1t,𝒃2t,,𝒃bt)3×b\mathbf{B}^{t}=\left(\boldsymbol{b}_{1}^{t},\boldsymbol{b}_{2}^{t},\ldots,\boldsymbol{b}_{b}^{t}\right)\in\mathbb{R}^{3\times b} based on the formula below:

𝐁t=𝐗t𝐂\mathbf{B}^{t}=\mathbf{X}^{t}\mathbf{C} (3)

where t{1,,T}t\in\{1,\ldots,T\} is the tt-th frame in the skeleton sequence; 𝐂J×b\mathbf{C}\in\mathbb{R}^{J\times b} is obtained by concatenating the corresponding vectors {𝒄1,,𝒄J}\{\boldsymbol{c}_{1},\cdots,\boldsymbol{c}_{J}\}. As such, the bone vector matrices of TT frames is defined as ={𝐁t3×bt=1,,T}\mathcal{B}=\left\{\mathbf{B}^{t}\in\mathbb{R}^{3\times b}\mid t=1,\ldots,T\right\}.

As shown in Fig 2 (b), the 𝒳\mathcal{X} passes similar steps to the ones in Fig 2 (a) except for the “transform to B” step which outputs a matrix 𝐕1×b×1\mathbf{V}\in\mathbb{R}^{1\times b\times 1} for scaling scales bone vector lengths. This scaled bone vector collection \mathcal{B} is converted into a matrix of size b×Tb\times T : 𝐁~3×b×T\widetilde{\mathbf{B}}\in\mathbb{R}^{3\times b\times T} with 3 channels, and the scaled bone matrix is calculated by:

𝐍v=𝐕×𝐁~\mathbf{N}_{v}=\mathbf{V}\times\widetilde{\mathbf{B}} (4)

After that, the scaled bone matrix is manipulated to recover the key joints position by:

𝐍=𝐂1𝐍v+𝒑0\mathbf{N}=\mathbf{C}^{-1}\mathbf{N}_{v}+\boldsymbol{p}_{0} (5)

where 𝒑0\boldsymbol{p}_{0} is the root joint position. Finally, we fed 𝐍\mathbf{N} to the embedding layer which is a T×JT\times J weight matrix to produce the bone-vectors feature enhanced image 𝐍~3×T×T\widetilde{\mathbf{N}}\in\mathbb{R}^{3\times T\times T}, which is also referred to as BVEI.

3.2.3 Multi-Frame Attention Module (MFAM)

Generally, the label of an action depends on a series of key skeleton frames, which means every single frame in an action sequence should have different weights in classifying an action. For example, giving an action sequence that denotes the action of “drinking”, it is essentially the “drink” frames that decide the action’s label instead of “picking up the cup”. Inspired by self-attention mechanism in Transformer module [32], we specify different importance scores of skeleton frames to emphasize key frames for CNN-based action classification. We now introduce our Multi-Frame Attention Module that devices the self-attention mechanism to yield different weights for skeleton-transformed images.

From Fig 2 (c), the 𝒳\mathcal{X} is firstly fed to a fully connected layer, then it splits to two branches: the left branch WQW^{Q} is used for providing query matrix 𝐐T×J\mathbf{Q}\in\mathbb{R}^{T\times J} and the right branch WKW^{K} for providing key matrix 𝐊T×J\mathbf{K}\in\mathbb{R}^{T\times J}. We compute the dot products of 𝐐\mathbf{Q} and transpose of key matrix 𝐊TJ×T\mathbf{K}^{T}\in\mathbb{R}^{J\times T}, then divide the dot products by 𝒅k\sqrt{\boldsymbol{d}_{k}}, where 𝒅k\boldsymbol{d}_{k} denotes the dimension of key matrix. Formally,

𝐀=softmax(𝐐𝐊T𝒅k)\mathbf{A}=\operatorname{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^{T}}{\sqrt{\boldsymbol{d}_{k}}}\right) (6)

where 𝐀T×T\mathbf{A}\in\mathbb{R}^{T\times T} is the output multi-frame attention map.

Thereafter, as shown in Fig 1, we multiply the multi-frame attention map 𝐀\mathbf{A} with key-joints feature enhanced image 𝐌~\widetilde{\mathbf{M}}. To prevent loss information, we add 𝐌~\widetilde{\mathbf{M}} itself after multiplication, and the final key-joint attention feature image is computed by:

Attention(𝐐,𝐊,𝐌~)=𝐌~𝐀+𝐌~\operatorname{Attention}(\mathbf{Q},\mathbf{K},\widetilde{\mathbf{M}})=\widetilde{\mathbf{M}}\odot\mathbf{A}+\widetilde{\mathbf{M}} (7)

where \odot is the element-wise product and Attention\operatorname{Attention} is the attention operator in our Multi-Frame Attention Module. Similarly, the bone-vector attention feature image is obtained by:

Attention(𝐐,𝐊,𝐍~)=𝐍~𝐀+𝐍~\operatorname{Attention}(\mathbf{Q},\mathbf{K},\widetilde{\mathbf{N}})=\widetilde{\mathbf{N}}\odot\mathbf{A}+\widetilde{\mathbf{N}} (8)

3.2.4 Joint Velocity Image Transform Module

Inspired by two-stream architecture [1], which utilizes optical flow fields to complement the original stream. We propose a Skeleton Moving Velocity Transform module to calculate moving velocity of each key joint as another action feature representation stream. Given a key joint position as 𝒑=(x,y,z)\boldsymbol{p}=(x,y,z), a JJ-joints skeleton is denoted as 𝐒={𝒑1,𝒑2,,𝒑J}\mathbf{S}=\left\{\boldsymbol{p}_{1},\boldsymbol{p}_{2},\ldots,\boldsymbol{p}_{J}\right\}. The moving velocity of each key joint between two frames is computed by:

𝐄=𝐒t+1𝐒tΔT={𝒑1t+1𝒑1tΔT,𝒑2t+1𝒑2tΔT,,𝒑Jt+1𝒑JtΔT}\mathbf{E}=\frac{\mathbf{S}^{t+1}-\mathbf{S}^{t}}{\Delta T}=\left\{\frac{\boldsymbol{p}_{1}^{t+1}-\boldsymbol{p}_{1}^{t}}{\Delta T},\frac{\boldsymbol{p}_{2}^{t+1}-\boldsymbol{p}_{2}^{t}}{\Delta T},\ldots,\frac{\boldsymbol{p}_{J}^{t+1}-\boldsymbol{p}_{J}^{t}}{\Delta T}\right\} (9)

where ΔT\Delta T denotes the time difference between two consecutive frames and tt is the frame index. As shown in Fig 1, we fed the scaled joints matrix 𝐌\mathbf{M} and the scaled bone matrix 𝐍\mathbf{N} into the transform module, which outputs J×TJ\times T image matrices. Thereafter, they are transformed to a 3×T×T3\times T\times T sized image matrix by linear transformation, which named key joint velocity image (KJV) and bone vectors velocity image (BVV), respectively.

3.2.5 Temporal Embedding Module(TE)

It is well-acknowledged that recognizing an action is highly dependent on the timing when the key poses happen [33]. For example, standing and sitting are two timely opposite actions, and one can be regarded as the reversed time sequence of another action. A 3D skeleton-based action recognition task generally transforms a skeleton sequence into an action feature image, and the temporal information is represented as the relative position of the pixels and embed in action feature image. However, conventional CNN model is considered to be spatial-agnostic [34], which makes it difficult to find the temporal information of an action-transformed image. To address this problem, we propose a Temporal Embedding (TE) module to enhance the temporal information in the action-transformed images.

Inspired by ViT [35], we provide a 1×T1\times T dimensional relative positional embedding as temporal information for action-transformed images. Here, we regard each column of an action-transformed image as the corresponding key frame pose, and we utilize the relative distance between columns to encode the spatial information. As shown in Fig 1, we add a learned 1×T1\times T dimensional positional embedding to all action-transformed images to enhance the temporal features. To this end, these images will be fed to our Four-Stream CNN model.

3.3 Action Recognition Model

In this section, we will discuss the details of CNN models used in our action recognition model and our training strategy.

3.3.1 Four-Stream CNN

Refer to caption
Figure 3: Four-stream CNN-based action recognition model.

To obtain more discriminate features from action-transformed images, we propose a four-stream CNN-based action recognition model. Our four-stream CNN model is a modified version of a lightweight CNN model which is proposed by Caetano et al. [36]. This lightweight CNN contains three convolution layers which have similar structure to VGG Net [37], and followed by two fully connected layers. Here, we modified their convolution layers with different padding size and different dimension of fully connected layers. The kernel size for each convolution layer is 3×33\times 3 with a stride of 2. After each convolution layer, we adopt a max pooling and a LeakyReLU [31] layer. For each action-transformed image, we employ a CNN model to extract the action feature vector, then, we concatenate these four action feature vectors and fed the consolidated one to a two layers fully connected networks for action recognition.

3.3.2 Model Optimization

With a skeleton-based action recognition dataset, our CNN-based action recognition architecture is trained uniformly by using categorical cross-entropy loss function:

cce=i=1ctilogyi\mathcal{L}_{cce}=-\sum_{i=1}^{c}t_{i}\cdot\log{y}_{i} (10)

where cc is the number of action categories, tit_{i} denotes the true value which is either 0 or 11, and yiy_{i} is the Softmax probability for ithi_{th} action category.

Compared with conventional hand-craft action-transformed image action recognition methods, our proposed methods do not require any manually action feature extraction. Moreover, compared with using large-scale CNN model, we do not need any pre-trained models to improve the performance of action recognition.

4 Experiments and Discussion

We implement our experiments in PyTorch framework with single GeForce RTX 3090 GPU. We train our model with Adam optimizer by using the initial learning rate at 0.001, the shrink factor of 0.1 after 20 epochs. Due to different size of datasets, we set a batch size of 64 for both NTU RGB+D [38] and NTU RGB+D 120 [39], and a batch size of 8 for UTKinect-Action3D [40].

4.1 Datasets

NTU RGB+D: This dataset is recorded by Kinect v2 sensors and each skeleton is depicted by 3D locations of 25 body joints. In details, it contains 56,880 3D skeleton sequences with 40 different human subjects and covers 60 daily action categories, which including single person actions, human-objective interactions, and human-human interactions. It is a challenge dataset due to its large scale, diverse action categories and various of camera views. We evaluate our method on this dataset by using two official evaluation metrics: cross-subject (subjects with 20 specific IDs are for training and the remaining for testing) and cross-view (samples from camera 1 for testing while the samples from camera 2 and 3 for training).

NTU RGB+D 120: This dataset is the extended version of the NTU RGB+D by adding another 60 challenging action categories. Compared with NTU RGB+D, this dataset is more challenge due to more subjects and increased variations of view points. In details, it contains totally 114,480 3D skeleton sequences with 106 subjects which performers in a wide range of age distribution. These samples are captured from 155 different camera views and recorded in 32 different scenes. This dataset provides two evaluation metrics: cross-subject (subjects with 53 specific IDs are for training and remaining for testing) and cross-setup (samples with even collection setup IDs for training while remaining of odd setup IDs for testing).

UTKinect-Action3D: Compared with NTU RGB+D and NTU RGB+D 120, this dataset is significantly smaller. It has 200 3D skeleton sequences of 10 daily human-object interactive action categories. Each skeleton is recorded as 20 body joints. Due to its small-scale, we utilize the cross-subject (half of the subjects for training and the remaining subjects for testing) as the evaluation metric to prevent potential risk of model overfitting.

4.2 Experiment Results

Here, we compare our AFE-CNN with several SOTA 3D skeleton action recognition methods on NTU RGB+D, NTU RGB+D 120 and UTKinect-Action3D respectively.

4.3 Results on NTU RGB+D

Table 1: Performance comparisons on NTU-RGB+D
Methods Architecture Accuracy (%)
CrossSubject CrossView
PAM+PTF [8] PAM 68.2 76.3
TSRJI [11] CNN 73.3 80.3
ImageGen+VGG-19 [20] CNN 75.2 82.1
ResTCN [19] CNN 74.3 83.1
Skelemotion [22] CNN 76.5 84.7
MTLN [41] CNN 79.6 84.8
SPMF+ResNet [21] CNN 78.9 86.2
Synthesized CNN [23] CNN 80.0 87.2
MTCNN+RotClips [15] CNN 81.1 87.4
ST-GCN [17] GCN 81.5 88.3
VA-LSTM [42] LSTM 80.7 88.8
PoT2I+Inception-v3 [10] CNN 83.8 90.3
3scale ResNet152 [18] CNN 84.6 90.9
AFE-CNN (Ours) CNN 86.2 92.2

From the results shown in Table 1, our AFE-CNN achieves the highest accuracy on both cross-subject (86.2%) and cross-view (92.2%) evaluation metrics. Accordingly, the corresponding confusion matrix of NTU-RGB+D results under cross-subject metric and cross-view metric are depicted in Fig 5 and Fig 4 respectively. It can be observed that there is a large gap of accuracy between deep learning-based methods (e.g.,MTLN [41],ResTCN [19]) and traditional machine learning-based methods (e.g., TSRJI [11] and PAM+PTF [8]). Because the traditional machine learning-based methods are hard to handle large-scale action dataset, therefore, the most of 3D skeleton-based action recognition methods are developed with deep learning-based architectures (e.g., CNN, LSTM, GCN).

In these deep learning-based approaches, the LSTM-based and GCN-based architecture achieve applaud performance (e.g., ST-GCN [17], VA-LSTM [42]) due to they are expert on spatio-temporal information processing. However, compared with CNN-based methods, the LSTM-based and GCN-based methods consume more computing time due to their complex network structure.

For CNN-based architecture, few of methods directly feed raw coordinate data to a CNN model (e.g., TCN [19]), and the most of methods first transform different levels of action features (e.g., high-level action features [20, 15], low-level action features [11, 10, 18, 41], motion features [22, 21, 23]) to image format as inputs for CNN models. To further improve action recognition accuracy, several methods [18, 10, 21] employee large scale CNN model (e.g., ResNet, Inception-v3) to extract more spatio-temporal information in large size action feature images. Although these approaches significantly improve the performance thanks to the benefits of large scale CNNs feature extraction in image classification, they load more computation burden which largely increase computing time. Moreover, the hand-crafted action features are difficult to handle challenging data with various of camera views, body sizes, and they have become the bottleneck to further improve the performance. In contrast to these methods, our AFE-CNN outperforms all handcrafted action feature image approaches and only requires light-weight CNN models.

Refer to caption
Figure 4: The confusion matrix of NTU RGB+D dataset with cross-view evaluation metrics.
Refer to caption
Figure 5: The confusion matrix of NTU RGB+D dataset with cross-subject evaluation metrics.

4.4 Results on NTU RGB+D 120

Table 2: Performance comparisons on NTU-RGB+D 120
Methods Architecture Accuracy (%)
CrossSubject CrossSetup
MTLN [41] CNN 58.4 57.9
MTCNN+RotClips [15] CNN 62.2 61.8
TSRJI [11] CNN 67.9 62.8
Synthesized CNN [23] CNN 60.3 63.2
GCA-LSTM [24] LSTM 61.2 63.3
Skelemotion [22] CNN 67.7 66.9
Logsig-RNN [16] RNN 68.3 67.2
Gimme Signals [43] CNN 70.8 71.6
ST-GCN [17] GCN 70.7 73.2
AS-GCN [27] GCN 77.7 78.9
GVFE+DH-TCN [29] GCN 78.3 79.8
SR-TSL [25] LSTM 74.1 79.9
SGCN [30] GCN 79.2 81.5
AFE-CNN (Ours) CNN 80.4 81.6

As shown in Table 2, our AFE-CNN outperforms than other methods on both cross-subject (80.2%) and cross-setup (81.6%). In this dataset, some methods utilize RNN-based[16] and LSTM-based[24, 25] methods and achieve promising results, and the best LSTM-based [25] method reaches 79.9% accuracy on cross-setup metric. In addition, the GCN-based [27, 29, 30] action recognition approaches achieve applaud performance due to GCNs are specialize on finding spatio-temporal information in raw 3D skeleton sequence. But they still suffer from a high computational burden. Compared with GCN and LSTM-based methods, our AFE-CNN achieves high accuracy of action recognition while ensuring a low computational burden.

Moreover, it can be seen that there is a large gap of accuracy between CNN-based methods and GCN-based methods (e.g., Gimme Signals [43] and GVFE+DH-TCN [29]). This verifies the CNN-based model is hard to handle more challenging dataset due to the limitations of hand-craft features transformed images (e.g., more various of camera views, body sizes and marginally different actions). However, our AFE-CNN can effectively enhance the action features of challenging actions that enables a light-weight CNN model to achieve outstanding performance.

4.5 Results on UTKinect-Action3D

Table 3: Performance comparisons on UTKinect-Action3D
Methods Architecture Accuracy (%)
MLSTM+Weight Fusion [12] RNN 96.0
GFT [13] GCN 96.0
ST-LSTM [14] LSTM 97.0
PAM+PTF [8] PAM 97.0
Lie Group [9] SVM 97.1
PoT2I+Inception-v3 [10] CNN 98.5
GR-GCN [28] GCN 98.5
GCA-LSTM [24] LSTM 99.0
MTCNN+RotClips [15] CNN 99.0
Multi-Stream CNN [44] CNN 99.0
AFE-CNN (Ours) CNN 99.0

From the results shown in Table 3, our AFE-CNN achieves the accuracy of 99.0% in this dataset and several methods [24, 15, 44] also achieve the same result. Accordingly, the corresponding confusion matrix of this dataset as shown in Fig 6, where some walk and wave action samples are misclassified to each other. In this small-scale dataset, some traditional machine learning-based methods [8, 9] even achieve comparable performance compared with other deep learning-based methods. The CNN-based methods [10, 15, 44] generally have an applaud performance while the GCN-based [13, 28] methods have a lower accuracy. In these CNN-based methods, only PoT2I+Inception-v3 [10] adopts a single CNN architecture, which cannot decode more action features and cause the action recognition accuracy less 0.5% than other CNN-based methods. Although our AFE-CNN utilize a light-weight CNN model, our four-streams CNN architecture ensure that more action features are encoded and decoded, therefore, our AFE-CNN also can achieve a promising performance in small-scale dataset.

Refer to caption
Figure 6: The confusion matrix of UTKinect-Action3D dataset.

4.6 Ablation Study

In this section, we carry out several ablation studies on NTU-RGB+D dataset to validate the contribution of key modules of AFE-CNN.

4.6.1 Ablation study on multi-frame attention and temporal embedding module

To verify the contributions made by our multi-frame attention module and temporal embedding module, we train the AFE-CNN with and without them and compare the results on the perspective of cross-subject and cross-view metrics.

Table 4: Ablation study on multi-frame attention and temporal embedding module, AFE-CNN denotes the model without multi-frame attention module (MFAM) and temporal embedding module (TE).
Methods Accuracy (%)
CrossSubject CrossView
AFE-CNN 84.9 90.1
AFE-CNN+MFAM 85.7 91.3
AFE-CNN+MFAM+TE 86.2 92.2

As shown in Table 4, we can observe that the temporal embedding module could improve accuracy on cross-subject and cross-view metrics by 0.5% and 0.9% respectively. This means the temporal embedding module can effectively enhance the temporal information in action feature images and improves the performance of AFE-CNN. For multi-frame attention module, it improves the accuracy on cross-subject metric by 0.8% and the accuracy on cross-view metric by 1.2%, both of them are higher than temporal embedding module, which means multi-frame attention module contributes more than the temporal embedding module in AFE-CNN.

4.6.2 Ablation study on Action Feature Enhance Modules

To verify the contributions of our key-joints feature enhance module and bone-vectors feature enhance module, we train the four-stream CNN-based action recognition model with and without these modules and compare their results on cross-subject and cross-view metrics.

Table 5: Ablation study of Action Feature enhance modules. AF-CNN only takes raw 3D skeleton transformed images without using any our proposed action feature enhanced module. The KJFE denotes the key joints feature enhance module and BVFE is the bone vectors feature enhance module.
Methods Accuracy (%)
CrossSubject CrossView
AF-CNN 79.2 84.0
AF-CNN+KJFE 82.9 89.1
AF-CNN+BVFE 83.8 88.2
AF-CNN+KJFE+BVFE 84.9 90.1

From the results shown in Table 5, we can observe that both of key-joints feature enhance module and bone-vectors feature enhance module can boost the performance of our CNN-based action recognition model. However, they show a different improvements on two evaluation metrics. The KJFE improves more on cross-view metrics (up to 89.1%) while BVFE improves more on cross-subject metric (up to 83.8%). This phenomenon is caused by two factors: one is the variations of camera views on two evaluation metrics, and the other is that the BVFE applies weights on bone vectors, which is more effective than key joints on some rare camera views. When combining KJFE and BVFE, we see significant performance improvements on both evaluation metrics.

Refer to caption
Figure 7: Class by class action recognition accuracy comparisons of AF-CNN with and without KJFE, BVFE by using cross-subject and cross-view metrics on the NTU RGB+D dataset

From the comparison of class by class action recognition accuracy as shown in Fig 7, we can find that our both KJFE and BVFE improve the action recognition accuracy on each action class. It is worth mentioning that on challenging action classes (e.g., class 11 reading and class 12 writing), our KJFE+BVFE improve the accuracy by 36.2% and 66.8% respectively on cross-view metric. For other action classes, our KJFE+BVFE remains high action recognition accuracy.

4.6.3 Ablation study on joint velocity transform module

Table 6: Ablation study on joint velocity transform module. The AFE-CNN only uses key joints feature enhanced images as inputs, JVTM denotes the joint velocity transform module.
Methods Accuracy (%)
CrossSubject CrossView
AFE-CNN 78.9 85.6
AFE-CNN+JVTM 86.2 92.2

Here we design an ablation study on the joint velocity transform module to analysis its contribution. As shown in Table 6, there is a large gap between whether the AFE-CNN uses the joint velocity transform module. For example, the JVTM improves the accuracy on cross-subject and cross view evaluation metrics by 7.3% and 6.6%, respectively. Thus, we believe that the joint velocity transform module plays a key role in achieving outstanding results in performing CNN-based action recognition task as it provides motion information which is critical to recognize actions.

4.7 Complexity measurement

In this section, we analysis the complexity of our AFE-CNN by measuring its computing time, floating-point operations per second (Flops) and further compare with several representative methods. It is noted that a comparison of processing time cannot be done fairly due to the diversity in use of frameworks (e.g., TensorFlow, PyTorch) and computing platforms (e.g., single GPU and multi-GPUs) are very diverse.

Table 7: Methods comparison of runtime and resource consumption. Flops denotes the floating-point operations per second.
Methods Architecture GPUs Times (ms) Flops
Synthesized CNN [23] CNN GTX 1080 GPU \sim 390ms -
SPMF+ResNet [21] CNN GTX 1080Ti GPU \sim 128ms 13.0GFlops
ST-GCN [17] GCN Tesla K80 GPU \sim 93ms 16.3GFlops
PoT2I+Inception-v3 [10] CNN 2 × GTX 1080Ti GPU \sim 38ms 6.0GFlops
AFE-CNN CNN RTX 3090 GPU \sim 3.5ms 1.4GFlops

As shown in Table 7, our AFE-CNN only cost 3.5ms for one forward inference and only consumes 1.4GFlops, which are significantly lower than other methods. Compared with other methods [10, 21, 23] using handcrafted action feature images, our AFE-CNN can be fully executed on GPU. Since we adopt a light-weight CNN model as the back bone, our method can minimize the computing time. Although the GCN-based model has applaudable performance on 3D skeleton-based action recognition task, it costs more computing time due to its complex computational structure.

4.8 Visualization of Action Feature Images

To illustrate our action feature enhance mechanism, we visualize the action feature enhanced images and multi-frame attention maps of drink water and jump up actions. We further compare them with the action feature images without utilizing any proposed feature enhance module.

Refer to caption
Figure 8: Visualization of action feature images, the origin denotes the images generated without any proposed action feature enhance module.

As shown in Fig 8, we can observe that the strips in action feature enhanced images are much more distinguishable than the images without being enhanced by our proposed feature enhance modules. Moreover, it is obviously that the TF-KJEI, TF-BVEI, T-KJVI, and T-BVVI encode different information in an action. Thus the CNN model can lead to outstanding performance thanks to the rich features in these action feature enhanced images. For MFAM, we can find that it successfully finds the key frames in an action sequence, where the yellow color indicates a stronger attention.

5 Conclusion

In this paper, we have proposed a novel learning-based action feature enhanced method for 3D skeleton-based action recognition, which namely AFE-CNN. Firstly, our AFE-CNN enhances the action features from key joint and bone vector perspectives to adapt to various camera views and body sizes. Secondly, the key frames of a skeleton sequence is enhanced by devicing a multi-frame attention module and a temporal embedding module to enhance temporal information. Thanks to the action feature enhance modules, our AFE-CNN effectively overcome the limitations of handcraft action features. The extensive experimental results demonstrate that our AFE-CNN achieves state-of-the-art performance on three benchmark datasets, and the recognition accuracy of challenging action classes is significantly improved. Notably, our AFE-CNN adopts light-weight CNN models so that the required computational load and computing time are extremely low. This makes it a good candidate technique for real-world applications.

References

  • [1] K. Simonyan, A. Zisserman, Two-stream convolutional networks for action recognition in videos, in: Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, K. Weinberger (Eds.), Advances in Neural Information Processing Systems, Vol. 27, Curran Associates, Inc., 2014.
  • [2] C. Feichtenhofer, A. Pinz, A. Zisserman, Convolutional two-stream network fusion for video action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [3] L. Wang, Y. Qiao, X. Tang, Action recognition with trajectory-pooled deep-convolutional descriptors, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 4305–4314.
  • [4] Z. Zhang, S. Liu, S. Liu, L. Han, Y. Shao, W. Zhou, Human action recognition using salient region detection in complex scenes, in: The proceedings of the third international conference on communications, signal processing, and systems, Springer, 2015, pp. 565–572.
  • [5] Y. Ye, Y. Tian, Embedding sequential information into spatiotemporal features for action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2016.
  • [6] V. Andersson, R. Dutra, R. Araújo, Anthropometric and human gait identification using skeleton data from kinect sensor, in: Proceedings of the 29th Annual ACM Symposium on Applied Computing, 2014, pp. 60–61.
  • [7] D. Xu, X. Xiao, X. Wang, J. Wang, Human action recognition based on kinect and pso-svm by representing 3d skeletons as points in lie group, in: 2016 International Conference on Audio, Language and Image Processing (ICALIP), IEEE, 2016, pp. 568–573.
  • [8] T. Huynh-The, C.-H. Hua, N. A. Tu, T. Hur, J. Bang, D. Kim, M. B. Amin, B. H. Kang, H. Seung, S.-Y. Shin, et al., Hierarchical topic modeling with pose-transition feature for action recognition using 3d skeleton data, Information Sciences 444 (2018) 20–35.
  • [9] R. Vemulapalli, F. Arrate, R. Chellappa, Human action recognition by representing 3d skeletons as points in a lie group, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 588–595.
  • [10] T. Huynh-The, C.-H. Hua, T.-T. Ngo, D.-S. Kim, Image representation of pose-transition feature for 3d skeleton-based action recognition, Information Sciences 513 (2020) 112–126.
  • [11] C. Caetano, F. Brémond, W. R. Schwartz, Skeleton image representation for 3d action recognition based on tree structure and reference joints, CoRR abs/1909.05704. arXiv:1909.05704.
    URL http://arxiv.org/abs/1909.05704
  • [12] S. Zhang, Y. Yang, J. Xiao, X. Liu, Y. Yang, D. Xie, Y. Zhuang, Fusing geometric features for skeleton-based action recognition using multilayer lstm networks, IEEE Transactions on Multimedia 20 (9) (2018) 2330–2343.
  • [13] J.-Y. Kao, A. Ortega, D. Tian, H. Mansour, A. Vetro, Graph based skeleton modeling for human activity analysis, in: 2019 IEEE International Conference on Image Processing (ICIP), IEEE, 2019, pp. 2025–2029.
  • [14] J. Liu, A. Shahroudy, D. Xu, G. Wang, Spatio-temporal lstm with trust gates for 3d human action recognition, in: B. Leibe, J. Matas, N. Sebe, M. Welling (Eds.), Computer Vision – ECCV 2016, Springer International Publishing, Cham, 2016, pp. 816–833.
  • [15] Q. Ke, M. Bennamoun, S. An, F. Sohel, F. Boussaid, Learning clip representations for skeleton-based 3d action recognition, IEEE Transactions on Image Processing 27 (6) (2018) 2842–2855.
  • [16] S. Liao, T. J. Lyons, W. Yang, H. Ni, Learning stochastic differential equations using RNN with log signature features, CoRR abs/1908.08286. arXiv:1908.08286.
    URL http://arxiv.org/abs/1908.08286
  • [17] S. Yan, Y. Xiong, D. Lin, Spatial temporal graph convolutional networks for skeleton-based action recognition, in: Thirty-second AAAI conference on artificial intelligence, 2018.
  • [18] B. Li, Y. Dai, X. Cheng, H. Chen, Y. Lin, M. He, Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep cnn, in: 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), IEEE, 2017, pp. 601–604.
  • [19] T. S. Kim, A. Reiter, Interpretable 3d human action analysis with temporal convolutional networks, CoRR abs/1704.04516. arXiv:1704.04516.
    URL http://arxiv.org/abs/1704.04516
  • [20] C. Li, S. Sun, X. Min, W. Lin, B. Nie, X. Zhang, End-to-end learning of deep convolutional neural network for 3d human action recognition, in: 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), IEEE, 2017, pp. 609–612.
  • [21] H.-H. Pham, L. Khoudour, A. Crouzil, P. Zegers, S. A. Velastin, Skeletal movement to color map: A novel representation for 3d action recognition with inception residual networks, in: 2018 25th IEEE International Conference on Image Processing (ICIP), 2018, pp. 3483–3487. doi:10.1109/ICIP.2018.8451404.
  • [22] C. Caetano, J. Sena, F. Brémond, J. A. dos Santos, W. R. Schwartz, Skelemotion: A new representation of skeleton joint sequences based on motion information for 3d action recognition, CoRR abs/1907.13025. arXiv:1907.13025.
    URL http://arxiv.org/abs/1907.13025
  • [23] M. Liu, H. Liu, C. Chen, Enhanced skeleton visualization for view invariant human action recognition, Pattern Recognition 68 (2017) 346–362.
  • [24] J. Liu, G. Wang, L.-Y. Duan, K. Abdiyeva, A. C. Kot, Skeleton-based human action recognition with global context-aware attention lstm networks, IEEE Transactions on Image Processing 27 (4) (2017) 1586–1599.
  • [25] C. Si, Y. Jing, W. Wang, L. Wang, T. Tan, Skeleton-based action recognition with spatial reasoning and temporal stack learning, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018.
  • [26] I. Lee, D. Kim, S. Kang, S. Lee, Ensemble deep learning for skeleton-based action recognition using temporal sliding lstm networks, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 1012–1020.
  • [27] M. Li, S. Chen, X. Chen, Y. Zhang, Y. Wang, Q. Tian, Actional-structural graph convolutional networks for skeleton-based action recognition, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 3595–3603.
  • [28] X. Gao, W. Hu, J. Tang, J. Liu, Z. Guo, Optimized skeleton-based action recognition via sparsified graph regression, in: Proceedings of the 27th ACM International Conference on Multimedia, 2019, pp. 601–610.
  • [29] K. Papadopoulos, E. Ghorbel, D. Aouada, B. Ottersten, Vertex feature encoding and hierarchical temporal modeling in a spatial-temporal graph convolutional network for action recognition, arXiv preprint arXiv:1912.09745.
  • [30] P. Zhang, C. Lan, W. Zeng, J. Xing, J. Xue, N. Zheng, Semantics-guided neural networks for efficient skeleton-based human action recognition, in: proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 1112–1121.
  • [31] K. He, X. Zhang, S. Ren, J. Sun, Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, CoRR abs/1502.01852. arXiv:1502.01852.
    URL http://arxiv.org/abs/1502.01852
  • [32] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, CoRR abs/1706.03762. arXiv:1706.03762.
    URL http://arxiv.org/abs/1706.03762
  • [33] P. P. San, P. Kakar, X.-L. Li, S. Krishnaswamy, J.-B. Yang, M. N. Nguyen, Chapter 9 - deep learning for human activity recognition, in: H.-H. Hsu, C.-Y. Chang, C.-H. Hsu (Eds.), Big Data Analytics for Sensor-Network Collected Intelligence, Intelligent Data-Centric Systems, Academic Press, 2017, pp. 186–204. doi:https://doi.org/10.1016/B978-0-12-809393-1.00009-X.
    URL https://www.sciencedirect.com/science/article/pii/B978012809393100009X
  • [34] S. Sabour, N. Frosst, G. E. Hinton, Dynamic routing between capsules, CoRR abs/1710.09829. arXiv:1710.09829.
    URL http://arxiv.org/abs/1710.09829
  • [35] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An image is worth 16x16 words: Transformers for image recognition at scale, CoRR abs/2010.11929. arXiv:2010.11929.
    URL https://arxiv.org/abs/2010.11929
  • [36] C. Caetano, J. Sena, F. Brémond, J. A. dos Santos, W. R. Schwartz, Skelemotion: A new representation of skeleton joint sequences based on motion information for 3d action recognition, CoRR abs/1907.13025. arXiv:1907.13025.
    URL http://arxiv.org/abs/1907.13025
  • [37] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, CoRR abs/1409.1556.
  • [38] A. Shahroudy, J. Liu, T.-T. Ng, G. Wang, Ntu rgb+d: A large scale dataset for 3d human activity analysis, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [39] J. Liu, A. Shahroudy, M. Perez, G. Wang, L.-Y. Duan, A. C. Kot, Ntu rgb+d 120: A large-scale benchmark for 3d human activity understanding, IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (10) (2020) 2684–2701. doi:10.1109/TPAMI.2019.2916873.
  • [40] L. Xia, C. Chen, J. Aggarwal, View invariant human action recognition using histograms of 3d joints, in: Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on, IEEE, 2012, pp. 20–27.
  • [41] Q. Ke, M. Bennamoun, S. An, F. Sohel, F. Boussaid, A new representation of skeleton sequences for 3d action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [42] P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, N. Zheng, View adaptive recurrent neural networks for high performance human action recognition from skeleton data, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 2117–2126.
  • [43] R. Memmesheimer, N. Theisen, D. Paulus, Gimme signals: Discriminative signal encoding for multimodal activity recognition, in: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020, pp. 10394–10401. doi:10.1109/IROS45743.2020.9341699.
  • [44] M. Liu, C. Chen, H. Liu, 3d action recognition using data visualization and convolutional neural networks, in: 2017 IEEE International Conference on Multimedia and Expo (ICME), IEEE, 2017, pp. 925–930.