Temporal Contrastive Graph Learning for Video Action Recognition and Retrieval
Abstract
Attempt to fully discover the temporal diversity and chronological characteristics for self-supervised video representation learning, this work takes advantage of the temporal dependencies within videos and further proposes a novel self-supervised method named Temporal Contrastive Graph Learning (TCGL). In contrast to the existing methods that ignore modeling elaborate temporal dependencies, our TCGL roots in a hybrid graph contrastive learning strategy to jointly regard the inter-snippet and intra-snippet temporal dependencies as self-supervision signals for temporal representation learning. To model multi-scale temporal dependencies, our TCGL integrates the prior knowledge about the frame and snippet orders into graph structures, i.e., the intra-/inter- snippet temporal contrastive graphs. By randomly removing edges and masking nodes of the intra-snippet graphs or inter-snippet graphs, our TCGL can generate different correlated graph views. Then, specific contrastive learning modules are designed to maximize the agreement between nodes in different views. To adaptively learn the global context representation and recalibrate the channel-wise features, we introduce an adaptive video snippet order prediction module, which leverages the relational knowledge among video snippets to predict the actual snippet orders. Experimental results demonstrate the superiority of our TCGL over the state-of-the-art methods on large-scale action recognition and video retrieval benchmarks.

1 Introduction
Deep convolutional neural networks (CNNs) [21] have achieved state-of-the-art performance in many visual recognition tasks. This can be primarily attributed to the learned rich representation from well-trained networks using large-scale image/video datasets (e.g. ImageNet [6], Kinetics [17], SomethingSomething [9]) with strong supervision information [18]. However, annotating such large-scale data is laborious, expensive, and impractical, especially for complex data-based high-level tasks, such as video action understanding and video retrieval. To fully leverage the existing large amount of unlabeled data, self-supervised learning gives a reasonable way to utilize the intrinsic characteristics of unlabeled data to obtain supervisory signals, which has attracted increasing attention.
Different from image data that can be handled by defining proxy tasks (e.g., predicting relative positions of image patches [7], solving jigsaw puzzles [36], inpainting images [38], and predicting the image color channel [24]) for self-supervised learning, video data additionally contains temporal information that can be leveraged to learn the supervisory signals. Recently, a variety of approaches have been proposed such as order verification [34, 8], order prediction [25, 61, 53], speediness prediction [1, 62]. However, all of these methods consider the temporal dependency only from a single scale (i.e., short-term or long-term) and ignore the multi-scale temporal dependencies, i.e., they extract either snippet-level or frame-level features via 2D/3D CNNs and neglect to integrate these features to model temporal dependencies at multiple time scales.
In this work, we argue that modeling multi-scale temporal dependencies is beneficial for various video classification tasks. Firstly, the recent neuroscience studies [30, 46, 14, 5, 33] prove that the human visual system can perceive detailed motion information by capturing both long-term and short-term temporal dependencies. This has been inspired by several famous supervised learning methods (e.g., Nonlocal [56], PSANet [65], GloRe [4], and ACNet [48]). Secondly, an action usually consists of several temporal dependencies at both short-term and long-term timescales. As shown in Figure 1, the action of handshaking contains the long-term temporal dependencies (video snippets) of walking forward, shaking hands, and hugging, while it also includes the short-term temporal dependencies (frame-sets within a snippet) of periodic hands and feet movement. Randomly shuffling the frames or snippets cannot preserve the semantic content of the video. Actually, the short-term temporal dependencies within a video snippet is important especially for videos that contain strict temporal coherence, such as videos in SomethingSomething datasets [9]. Therefore, both short-term (e.g., intra-snippet) and long-term (e.g., inter-snippet) temporal dependencies are essential and should be jointly modeled to learn discriminative temporal representations for unlabeled videos.
Inspired by the convincing performance and high interpretability of graph convolutional networks (GCN) [19, 47, 64], several works [56, 58, 69, 16, 63] were proposed to increase the temporal diversity of videos by using GCN in such a supervised learning fashion with large amounts of labeled data. Unfortunately, due to the lack of principles or guidelines to explore the intrinsic temporal knowledge of unlabeled video data, it is quite challenging to utilize GCN for self-supervised video representation learning.
Attempt to address this issue by regarding the jointly modeling the inter-snippet and intra-snippet temporal dependencies as guidelines, this work presents a novel self-supervised approach, named Temporal Contrastive Graph Learning (TCGL), targeting at learning the multi-scale temporal dependency knowledge within videos by guiding the video snippet order prediction in an adaptive manner. Specifically, a given video is sampled into several fixed-length snippets and then randomly shuffled. For each snippet, all the frames from this snippet are sampled into several fixed-length frame-sets. We utilize 3D convolutional neural networks (CNNs) as the backbone network to extract features for these snippets and frame-sets. To preserve both inter-snippet and intra-snippet temporal dependencies within videos, we propose graph neural network (GNN) structures with prior knowledge about the snippet orders and frame-set orders. The video snippets of a video and their chronological characteristics are used to construct the inter-snippet temporal graph. Similarly, the frame-sets within a video snippet and their chronological characteristics are leveraged to construct the intra-snippet temporal graph. Furthermore, we randomly remove edges and mask node features of the intra-snippet graphs or inter-snippet graphs to generate different correlated graph views. Then, specific contrastive learning modules are designed to enhance its discriminative capability for temporal representation learning. To learn the global context representation and recalibrate the channel-wise features adaptively, we propose an adaptive video snippet order prediction module, which leverages relational knowledge among video snippets to predict the actual snippet orders. The main contributions of the paper can be summarized as follows:
-
•
Integrated with intra-snippet and inter-snippet temporal dependencies, we propose intra-snippet and inter-snippet temporal contrastive graphs to increase the temporal diversity among video frames and snippets in a graph contrastive self-supervised learning manner.
-
•
To learn the global context representation and recalibrate the channel-wise features adaptively for each video snippet, we propose an adaptive video snippet order prediction module, which employs the relational knowledge among video snippets to predict orders.
-
•
Extensive experiments on three networks and two downstream tasks show that the proposed method achieves state-of-the-art performance and demonstrate the great potential of the learned video representations.
The rest of the paper is organized as follows. We first review related works in Section 2, then the details of the proposed method are explained in Section 3. In Section 4, the implementation and results of the experiments are provided and analyzed. Finally, we conclude our works in Section 5.
2 Related Work
In this section, we will introduce the recent works on supervised and self-supervised video representation learning.
2.1 Supervised Video Representation Learning

For video representation learning, a large number of supervised learning methods have been received increasing attention. The methods include traditional methods [23, 20, 49, 50, 35, 39, 27] and deep learning methods [41, 44, 54, 28, 45, 55, 66, 26, 29]. To model and discover temporal knowledge in videos, two-stream CNNs [41] judged the video image and dense optical flow separately, then directly fused the class scores of these two networks to obtain the classification result. C3D [44] processed videos with a three-dimensional convolution kernel. Temporal Segment Networks (TSN) [55] sampled each video into several segments to model the long-range temporal structure of videos. Temporal Relation Network (TRN) [66] introduced an interpretable network to learn and reason about temporal dependencies between video frames at multiple temporal scales. Temporal Shift Module (TSM) [26] shifted part of the channels along the temporal dimension to facilitate information exchanged among neighboring frames. Although these supervised methods achieve promising performance in modeling temporal dependencies, they require large amounts of labeled videos for training an elaborate model, which is time-consuming and labor-intensive.
2.2 Self-supervised Video Representation Learning
Although there exists a large amount of videos, it may take a great effort to annotate such massive data. Self-supervised learning generates various pretext tasks to leverage abundant unlabeled data. The learned model from pretext tasks can be directly applied to downstream tasks for feature extraction or fine-tuning. Specific contrastive learning methods have been proposed, such as the NCE [11], MoCo [15], BYOL [10], SimCLR [3]. To better model topologies, contrastive learning methods on graphs [40, 68, 12] have also been attracted increasing attention.
For self-supervised video representation learning, how to effectively explore temporal information is important. Many existing works focus on the discovering of temporal information. Shuffle&Learn [34] randomly shuffled video frames and trained a network to distinguish whether these video frames are in the right order or not. Odd-one-out Network [8] proposed to identify unrelated or odd video clips. Order prediction network (OPN) [25] trained networks to predict the correct order of shuffled frames. VCOP [61] used 3D convolutional networks to predict the orders of shuffled video clips. SpeedNet [1] designed a network to detect whether a video is playing at a normal rate or sped up rate. Video-pace [53] utilized a network to identify the right paces of different video clips. In addition to focusing on the temporal dependency, Mas [52] proposed a self-supervised learning method by regressing both motion and appearance statistics along spatial and temporal dimensions. ST-puzzle [18] used space-time cubic puzzles to design pretext task. IIC [43] introduced intra-negative samples by breaking temporal relations in video clips, and use these samples to build an inter-intra contrastive framework. Though the above works utilize temporal dependency or design specific pretext tasks for video self-supervised learning, the comprehensive temporal diversity and chronological characteristics are not fully explored. In our work, we build a novel inter-intra snippet graph structure to model multi-scale temporal dependencies, and produce self-supervision signals about video snippet orders contrastively.
3 Temporal Contrastive Graph Learning
In this section, we first give a brief overview of the proposed TCGL, shown in Figure 2, which mainly consists of four stages. (1) Sample and shuffle, for each video, several snippets are uniformly sampled and shuffled. For each snippet, all the frames from this snippet are sampled into several fixed-length frame-sets; (2) Feature extraction, 3D CNNs are utilized to extract features for these snippets and frame-sets, and all 3D CNNs share the same weights; (3) Temporal contrastive graph learning, we build two kinds of temporal contrastive graph structures (intra-snippet graph and inter-snippet graph) with the prior knowledge about the frame-set orders and snippet orders. To generate different correlated graph views for specific graphs, we randomly remove edges and mask node features of the intra-snippet graphs or inter-snippet graphs. Then, we design specific contrastive losses for both the intra-snippet and inter-snippet graphs to model multi-scale temporal dependencies. This can increase the temporal diversity of video representations; (4) Order prediction, the learned snippet features from the temporal contrastive graphs are adaptively forwarded through an adaptive snippet order prediction module to output the probability distribution over the possible orders.

For a better presentation, we first introduce several definitions. Given a video , the snippets from this video are composed of continuous frames with the size , where is the number of channels, is the number of frames, and indicate the height and width of frames. The size of the 3D convolutional kernel is , where is the temporal length and is the spatial size. We define an ordered snippet tuples as , the frame-sets from snippet is denoted as . The subscripts here represent the chronological order. Let denote a graph, where represents the node set and represents the edge set. We denote the feature matrix and the adjacency matrix as and , where is the feature of , and if .
3.1 Sample and Shuffle
In this stage, we randomly sample consecutive frames (snippets) from the video to construct video snippet tuples. If we sample snippets from a video, there are possible snippet orders. Since the snippet order prediction is purely a proxy task of the TCGL and our focus is the learning of 3D CNNs, we restrict the number of snippets of a video between to to alleviate the complexity of the order prediction task, inspired by the previous works [36, 61, 60]. The snippets are sampled uniformly from the video with the interval of frames. After sampling, the snippets are shuffled to form the snippet tuples . For each snippet , all the frames within are uniformly divided into frame-sets with equal length, then we get the frame-set for snippet . For snippet tuples, they contain dynamic information and strict chronological relations of a video, which is essentially the long-term temporal dependency of the videos. For the frame-sets within a snippet, the frame-level temporal relation among frames provides us the short-term temporal dependency of the videos. By taking both long-term and short-term temporal dependencies into consideration, we can increase the temporal diversity more comprehensively and precisely.
3.2 Feature Extraction
To extract spatio-temporal features, we choose C3D [44], R3D [45] and R(2+1)D [45] as feature encoders. The same 3D CNNs are used for all snippets and frame-sets, as Figure 2 (b) shows. C3D is an extension from 2D CNNs for spatio-temporal representation learning since it can model the temporal information and dynamics of the videos. C3D network consists of convolutional layers stacked one by one, with pooling layers interleaved, and followed by two fully connected layers. The size of all convolutional kernels is , which is validated in previous work [44]. R3D is the 3D CNNs with residual connections. R3D block consists of two 3D convolutional layers followed by batch normalization and ReLU layers. The input and output are connected with a residual unit before the ReLU layer. R(2+1)D is similar to R3D. The 3D convolution is decomposed into two separate operations, the one is 2D spatial convolution, and the other is 1D temporal convolution.
3.3 Temporal Contrastive Graphs
Due to the effectiveness of graph convolutional network (GCN) [19, 68, 59] in modeling unregular graph-structured relationship, we use it to explore node interaction within each snippet and frame-set for modeling multi-scale temporal dependencies of videos. After obtaining the feature vectors for snippets and frame-sets, we construct two kinds of temporal contrastive graph structures: inter-snippet and intra-snippet temporal contrastive graphs, to increase the temporal diversity of videos, as shown in Figure 2 (c).
To build intra-snippet and inter-snippet temporal contrastive graphs, we take advantage of prior knowledge about the chronological relation and the corresponding feature vectors. To fix notation, we denote intra-snippet and inter-snippet graphs as and , respectively, where , is the number of frame-sets in a video snippet. As shown in Figure 3, the prior knowledge that the correct order of frames in frame-sets, and the correct order of snippets in snippet tuples are already known because our proxy task is video snippet order prediction. Therefore, we can utilize the prior chronological relationship to determine the edges of graphs. For example, in Figure 3, if we know that snippets (frame-sets) are ranking chronologically , we can connect the temporally related nodes and disconnect the temporally unrelated nodes. Here, we take inter-snippet temporal graph as an example to clarify our temporal contrastive graph learning method.
For , we randomly remove edges and masking node features to generate two graph views and , and the node embeddings of two generated views are denoted as and . Since different graph views provide different contexts for each node, we corrupt the original graph at both structure and attribute levels to achieve contrastive learning between node embeddings from different views. Therefore, we propose two strategies for generating graph views: removing edges and masking node features.
The edges in the original graph are randomly removed using a random masking matrix , where the entry of is drawn from a Bernoulli distribution if for the original graph, and otherwise. Here denotes the probability of each edge being removed. Then, the resulting adjacency matrix can be computed as follows, where is Hadamard product.
(1) |
In addition, a part of node features is masked with zeros using a random vector , where each dimension of it is drawn from a Bernoulli distribution . Then, the generated masked features is calculated as follows:
(2) |
where is the concatenation operator. We jointly leverage these two strategies to generate graph views.
Inspired by the NCE loss [11], we propose a contrastive learning loss that distinguishes embeddings of the same node from these two distinct views from other node embeddings. Given a positive pair, the negative samples come from all other nodes in the two views (inter-view or intra-view). To compute the relationship of embeddings , from two views, we define the relation function , where is the L2 normalized dot product similarity, and is a non-linear projection with two-layer multi-layer perception. The pairwise contrastive objective for positive pair is defined as:
(3) |
where is an indicator function that equals to if , and is a temperature parameter, which is empirically set to . The first term in the denominator represents the positive pairs, the second term represents the inter-view negative pairs, the third term represents the intra-view negative pairs. Since two views are symmetric, the loss for another view is defined similarly for . The overall contrastive loss for is defined as follows:
(4) |
The contrastive loss for intra-snippet graphs can be computed similarly as .
(5) |
Then, the overall temporal contrastive graph loss is defined as follows, where and are the weights for inter-snippet graph and intra-snippet graph, respectively.
(6) |
3.4 Adaptive Order Prediction
We formulate the order prediction task as a classification task using the learned video snippet features from the temporal contrastive graph as the input and the probability distribution of orders as output. Since the features from different video snippets are correlated, we build an adaptive order prediction module that receives features from different video snippets and learns a global context embedding, then this embedding is used to recalibrate the input features from different snippets, as shown in Figure 4.

To fix notation, we assume that a video is shuffled into snippets, the snippet features of nodes learned from inter-snippet temporal contrastive graph learning are , where ). To utilize the correlation among these snippets, we concatenate these feature vectors and obtain joint representations through a fully-connected layer:
(7) |
where denotes the concatenation operation, denotes the joint representation, and are weights and bias of the fully-connected layer. We choose to restrict the model capacity and increase its generalization ability. To make use of the global context information aggregated in the joint representations , we predict excitation signal for it by a fully-connected layer:
(8) |
where and are weights and biases of the fully-connected layer. After obtaining the excitation signal , we use it to recalibrate the input feature adaptively by a simple gating mechanism:
(9) |
where is channel-wise product operation for each element in the channel dimension, and is the ReLU function. In this way, we can allow the features of one snippet to recalibrate the features of another snippet while concurrently preserving the correlation among different snippets.
Finally, these refined features are concatenated and fed into two-layer perception with soft-max to output the snippet order prediction. The cross-entropy loss is used to measure the correctness of the prediction:
(10) |
where and represent the probability that the sample belongs to the order class in ground-truth and prediction, respectively. denotes the number of all possible orders.
4 Experiments
In this section, we first elaborate experimental settings, and then conduct ablation studies to analyze the contribution of key components. Finally, the learned 3D CNNs are evaluated on video action recognition and video retrieval tasks, and then compared with state-of-the-art methods.
4.1 Experimental Setting
Datasets. We evaluate our method on three action recognition datasets, UCF101 [42], HMDB51 [22] and Kinetics-400 [17]. UCF101 is collected from websites containing action classes with k videos for training and k videos for testing. HMDB51 is collected from various sources with action classes and k videos for training and k videos for testing. Kinetics-400 is a large-scale action recognition dataset, which contains 400 action classes and around 306k videos. In this work , we use the training split (around 240k videos) as the pre-training dataset.
Network Architecture. We use PyTorch [37] to implement the whole framework. For video encoder, C3D, R3D and R(2+1)D are used as backbones, where the kernel size of 3D convolutional layers is set to . The R3D network is implemented with no repetitions in conv{2-5}_x, which results in convolution layers in total. The C3D network is modified by replacing the two fully connected layers with global spatiotemporal pooling layers. The R(2+1)D network has the same architecture as the R3D network with only 3D kernels decomposed. Dropout layers are applied between fully-connected layers with . Our GCN for both inter-snippet and intra-snippet graphs consist of one graph convolutional layer with output channels.
Parameters. Following the settings in [61, 62], we set the snippet length of input video as , the interval length is set as , the number of snippets per tuple is , and the number of frame-sets within each snippet is . During training, we randomly split videos from the training set as the validation set. Video frames are resized to and then randomly cropped to . We set the parameters to balance the contribution between temporal contrastive graph module and adaptive order prediction module. The weights and are both set to according to the ablation study result. To optimize the framework, we use mini-bach stochastic gradient descent with the batchsize , the initial learning rate , the momentum and the weight decay . The training process lasts for epochs and the learning rate is decreased to after epochs. To make temporal contrastive graphs sensitive to subtle variance between different graph views, the parameters and for generating graph view 1 are empirically set to and , and for generating graph view 2. And the values of and are the same for both inter-snippet and intra-snippet graphs. The model with the lowest validation loss is saved to the best model.
4.2 Ablation Study
In this subsection, we conduct ablation studies on the first split of UCF101 with R3D as the backbone, to analyze the contribution of each component of our TCGL.
The number of snippets. The results of R3D on the snippet order prediction task with different number of snippets are shown in Table 1. The prediction accuracy decreases when the number of snippets increases because the difficulty of the prediction task grows when the snippets number increase, which makes the model hard to learn. Therefore, we use snippets per video to make a compromise between task complexity and prediction accuracy.
The number of frame-sets. Since the snippet length is , the number of frame-sets within each snippet can be . When the number is , the frame-set only contains static information without temporal information. When the number is or , it is hard to model the intra-snippet temporal relationship with too few frame-sets. From Table 2, we can observe that more frame-sets within a snippet will make the intra-snippet temporal modeling more difficult, which degrades the order prediction performance. Therefore, we choose frame-sets per snippet in the experiments for short-term temporal modeling.
The intra-snippet and inter-snippet graphs. To analyze the contribution of intra-snippet and inter-snippet temporal contrastive graphs, we set different values to and , shown in Table 3. To be noticed, removing intra-snippet graph will degrade the performance significantly even with the inter-snippet graph, which verifies the importance of intra-snippet graphs for modeling short-term temporal dependency. Additionally, when setting the weight values of intra-snippet graph and inter-snippet graph to and , the prediction accuracy is . While exchanging their weight values, the accuracy drops to , which validates the importance of intra-snippet graphs in modeling short-term temporal dependency. In addition, the performance of TCGL drops significantly when removing either or both of the graphs. When , the prediction accuracy is the best (). These results validate that both intra-snippet and inter-snippet temporal contrastive graphs are essential for increasing the temporal diversity of features.
Model | Snippet Length | Snippets Number | Accuracy |
R3D | 16 | 2 | 53.4 |
R3D | 16 | 3 | 80.2 |
R3D | 16 | 4 | 61.1 |
Model | Snippet Length | Frame-set Number | Accuracy |
R3D | 16 | 1 | 54.1 |
R3D | 16 | 2 | 54.1 |
R3D | 16 | 4 | 80.2 |
R3D | 16 | 8 | 63.1 |
Model | Intra () | Inter () | Prediction | Recognition |
---|---|---|---|---|
R3D | 0 | 0 | 55.6 | 57.3 |
R3D | 0 | 1 | 55.6 | 56.0 |
R3D | 1 | 0 | 76.6 | 60.9 |
R3D | 0.1 | 1 | 54.9 | 56.7 |
R3D | 1 | 0.1 | 80.2 | 66.8 |
R3D | 1 | 1 | 83.0 | 67.6 |
Method | Model | ASOR | Prediction | Recognition |
---|---|---|---|---|
TCGL | R3D | ✗ | 78.4 | 62.8 |
TCGL | R3D | ✓ | 80.2 | 67.6 |
The adaptive snippet order prediction. To analyze the contribution of our proposed adaptive snippet order prediction (ASOR) module, we remove this module and merely feed the concatenated features into multi-layer perception with soft-max to output the final snippet order prediction. It can be observed in Table 4 that our TCGL performs better than the TCGL without the ASOR module in both order prediction and action recognition tasks. This verifies that the ASOR module can better utilize relational knowledge among video snippets than simple concatenation.
The methods of introducing noises. To justify our superiority on modeling both feature and topology levels, we make comparisons with three simple graph corruption methods: (1) adding random Gaussian noise; (2) randomly removing edges; (3) randomly removing nodes. The (order prediction, action recognition) accuracies (%) for them are (), () and (), respectively, while we achieve (), justifying the our superiority on modeling both feature and topology levels.
Method | Backbone | Pretrain | UCF101 | HMDB51 |
---|---|---|---|---|
Object Patch [57] | AlexNet | UCF101 | 42.7 | 15.6 |
Shuffle [34] | CaffeNet | UCF101 | 50.9 | 19.8 |
OPN [25] | VGG | UCF101 | 56.3 | 22.1 |
Deep RL [2] | CaffeNet | UCF101 | 58.6 | 25.0 |
Random (Baseline) | C3D | UCF101 | 61.8 | 24.7 |
Mas [52] | C3D | UCF101 | 58.8 | 32.6 |
VCOP[61] | C3D | UCF101 | 65.6 | 28.4 |
COP[60] | C3D | UCF101 | 66.9 | 31.8 |
PRP [62] | C3D | UCF101 | 69.1 | 34.5 |
ST-puzzle [18] | C3D | K400 | 60.6 | 28.3 |
Mas [52] | C3D | K400 | 61.2 | 33.4 |
STS [51] | C3D | K400 | 71.8 | 37.8 |
TCGL (Ours) | C3D | UCF101 | 69.5 | 35.1 |
TCGL (Ours) | C3D | K400 | 75.2 | 38.9 |
Random (Baseline) | R3D | UCF101 | 54.5 | 23.4 |
VCOP [61] | R3D | UCF101 | 64.9 | 29.5 |
COP [60] | R3D | UCF101 | 66.0 | 28.0 |
TCP [31] | R3D | UCF101 | 64.8 | 34.7 |
PRP [62] | R3D | UCF101 | 66.5 | 29.7 |
ST-puzzle [18] | R3D | K400 | 65.8 | 33.7 |
DPC [13] | R3D | K400 | 68.2 | 34.5 |
TCP [31] | R3D | K400 | 70.5 | 41.1 |
TCGL (Ours) | R3D | UCF101 | 67.6 | 30.8 |
TCGL (Ours) | R3D | K400 | 76.8 | 41.5 |
Random (Baseline) | R(2+1)D | UCF101 | 55.8 | 22.0 |
VCP [32] | R(2+1)D | UCF101 | 66.3 | 32.2 |
VCOP [61] | R(2+1)D | UCF101 | 72.4 | 30.9 |
STS [51] | R(2+1)D | UCF101 | 73.6 | 34.1 |
COP [60] | R(2+1)D | UCF101 | 74.5 | 34.8 |
PRP [62] | R(2+1)D | UCF101 | 72.1 | 35.0 |
V-pace [53] | R(2+1)D | UCF101 | 75.9 | 35.9 |
V-pace [53] | R(2+1)D | K400 | 77.1 | 36.6 |
TCGL (Ours) | R(2+1)D | UCF101 | 74.9 | 36.2 |
TCGL (Ours) | R(2+1)D | K400 | 77.6 | 39.7 |
4.3 Action Recognition
To verify the effectiveness of our TCGL in action recognition, we initialize the backbones with the model pretrained on the first split of UCF101 or the whole K400 training-set, and fine-tune on UCF101 and HMDB51, the fine-tuning stops after epochs. The features extracted by the backbones are fed into fully-connected layers to obtain the prediction. For testing, we sample clips for each video to generate clip predictions, and then average these predictions to obtain the final prediction results. The average classification accuracy over three splits is reported and compared with other self-supervised methods in Table 5. The “Random” means the model is randomly initialized without pre-training. When pre-trained on UCF101, we outperform the current best-performing method PRP [62]. When pre-trained on K400, we outperform the current best-performing method V-pace [53]. In addition, we consistently outperform the other state-of-the-art methods and random initialization method for all evaluation metrics. Furthermore, when pre-trained on UCF101, we achieve better accuracies than some K400 pre-trained methods (Mas [52] and ST-puzzle [18]). These results validate that our TCGL can effectively increase the temporal diversity of videos and learn discriminative spatio-temporal representations.
Method | Backbone | Pretrain | T1 | T5 | T10 | T20 | T50 |
---|---|---|---|---|---|---|---|
Jigsaw [36] | AlexNet | UCF | 19.7 | 28.5 | 33.5 | 40.0 | 49.4 |
OPN [25] | VGG | UCF | 19.9 | 28.7 | 34.0 | 40.6 | 51.6 |
Deep RL [2] | CaffeNet | UCF | 25.7 | 36.2 | 42.2 | 49.2 | 59.5 |
SpeedNet [1] | S3D-G | K400 | 13.0 | 28.1 | 37.5 | 49.5 | 65.0 |
Random | C3D | UCF | 16.7 | 27.5 | 33.7 | 41.4 | 53.0 |
VCOP [61] | C3D | UCF | 12.5 | 29.0 | 39.0 | 50.6 | 66.9 |
PRP [62] | C3D | UCF | 23.2 | 38.1 | 46.0 | 55.7 | 68.4 |
V-pace [53] | C3D | UCF | 20.0 | 37.4 | 46.9 | 58.5 | 73.1 |
TCGL (Ours) | C3D | UCF | 22.5 | 40.7 | 49.8 | 59.9 | 73.3 |
TCGL (Ours) | C3D | K400 | 23.6 | 41.2 | 50.1 | 60.4 | 74.2 |
Random | R3D | UCF | 9.9 | 18.9 | 26.0 | 35.5 | 51.9 |
VCOP [61] | R3D | UCF | 14.1 | 30.3 | 40.4 | 51.1 | 66.5 |
PRP [62] | R3D | UCF | 22.8 | 38.5 | 46.7 | 55.2 | 69.1 |
V-pace [53] | R3D | UCF | 19.9 | 36.2 | 46.1 | 55.6 | 69.2 |
TCGL (Ours) | R3D | UCF | 23.3 | 39.6 | 48.4 | 58.8 | 72.4 |
TCGL (Ours) | R3D | K400 | 23.9 | 43.0 | 53.0 | 62.9 | 75.7 |
Random | R(2+1)D | UCF | 10.6 | 20.7 | 27.4 | 37.4 | 53.1 |
VCOP [61] | R(2+1)D | UCF | 10.7 | 25.9 | 35.4 | 47.3 | 63.9 |
PRP [62] | R(2+1)D | UCF | 20.3 | 34.0 | 41.9 | 51.7 | 64.2 |
V-pace [53] | R(2+1)D | UCF | 17.9 | 34.3 | 44.6 | 55.5 | 72.0 |
TCGL (Ours) | R(2+1)D | UCF | 20.6 | 36.2 | 45.5 | 56.2 | 72.1 |
TCGL (Ours) | R(2+1)D | K400 | 21.9 | 40.2 | 49.6 | 59.7 | 73.1 |
Method | Backbone | Pretrain | T1 | T5 | T10 | T20 | T50 |
---|---|---|---|---|---|---|---|
Random | C3D | UCF | 7.4 | 20.5 | 31.9 | 44.5 | 66.3 |
VCOP [61] | C3D | UCF | 7.4 | 22.6 | 34.4 | 48.5 | 70.1 |
PRP [62] | C3D | UCF | 10.5 | 27.2 | 40.4 | 56.2 | 75.9 |
V-pace [53] | C3D | UCF | 8.0 | 25.2 | 37.8 | 54.4 | 77.5 |
TCGL (Ours) | C3D | UCF | 10.7 | 28.6 | 41.1 | 57.9 | 77.7 |
TCGL (Ours) | C3D | K400 | 12.3 | 30.4 | 42.9 | 59.1 | 79.2 |
Random | R3D | UCF | 6.7 | 18.3 | 28.3 | 43.1 | 67.9 |
VCOP [61] | R3D | UCF | 7.6 | 22.9 | 34.4 | 48.8 | 68.9 |
PRP [62] | R3D | UCF | 8.2 | 25.8 | 38.5 | 53.3 | 75.9 |
V-pace [53] | R3D | UCF | 8.2 | 24.2 | 37.3 | 53.3 | 74.5 |
TCGL (Ours) | R3D | UCF | 10.9 | 29.5 | 42.9 | 56.9 | 76.8 |
TCGL (Ours) | R3D | K400 | 11.2 | 30.6 | 43.8 | 58.1 | 78.0 |
Random | R(2+1)D | UCF | 4.5 | 14.8 | 23.4 | 38.9 | 63.0 |
VCOP [61] | R(2+1)D | UCF | 5.7 | 19.5 | 30.7 | 45.8 | 67.0 |
PRP [62] | R(2+1)D | UCF | 8.2 | 25.3 | 36.2 | 51.0 | 73.0 |
V-pace [53] | R(2+1)D | UCF | 10.1 | 24.6 | 37.6 | 54.4 | 77.1 |
TCGL (Ours) | R(2+1)D | UCF | 11.1 | 30.4 | 43.0 | 56.5 | 77.4 |
TCGL (Ours) | R(2+1)D | K400 | 13.2 | 33.5 | 46.4 | 59.3 | 80.2 |
4.4 Video Retrieval
To further verify the effectiveness of our TCGL in video retrieval, we test our TCGL on the nearest-neighbor video retrieval. Since the video retrieval task is conducted with features extracted by the backbone network without fine-tuning, its performance largely relies upon the representative capacity of self-supervised model. The experiment is conducted on the first split of UCF101 or the whole training-set of K400, following the protocol in [61, 62]. In video retrieval, we extract video features from the backbone pre-trained by TCGL. Each video in the testing set is used to query nearest videos from the training set using the cosine distance. When the class of a test video appears in the classes of nearest training videos, it is considered as the correct predicted video. We show top-1, top-5, top-10, top-20, and top-50 retrieval accuracies on UCF101 and HMDB51 datasets, and compare our method with other self-supervised methods, as shown in Table 6 and 7. For all backbones, our TCGL outperforms the state-of-the-art methods on nearly all evaluation metrics by substantial margins. Figure 5 visualizes a query video snippet and its top-3 nearest neighbors from the UCF101 training set using the TCGL embedding. It can be observed that the representation learned by the TCGL has the ability to retrieve videos with the same semantic meaning. To have a better understanding of what TCGL learns, we follow the Class Activation Map [67] to visualize the spatio-temporal regions, as shown in Figure 6. These examples exhibit a strong correlation between highly activated regions and the dominant movement in the scene. This validates that our TCGL can learn discriminative temporal representations for videos.


5 Conclusion
In this paper, we propose a novel Temporal Contrastive Graph Learning (TCGL) approach for self-supervised video representation learning. With inter-intra snippet graph contrastive learning strategy and adaptive video snippet order prediction task, the temporal diversity and multi-scale temporal dependency can be well discovered. The proposed TCGL is applied to video action recognition and video retrieval tasks with three kinds of 3D CNNs. Extensive experiments demonstrate the superiority of our TCGL over the state-of-the-art methods on large-scale benchmarks.
Acknowledgements. This work was supported in part by the National Natural Science Foundation of China under Grant 62002395, the China Postdoctoral Science Foundation funded project under Grant 2020M672966, and the Fundamental Research Funds for the Central Universities under Grant 20lgpy131.
References
- [1] Sagie Benaim, Ariel Ephrat, Oran Lang, Inbar Mosseri, William T Freeman, Michael Rubinstein, Michal Irani, and Tali Dekel. Speednet: Learning the speediness in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9922–9931, 2020.
- [2] Uta Buchler, Biagio Brattoli, and Bjorn Ommer. Improving spatiotemporal self-supervision by deep reinforcement learning. In Proceedings of the European conference on computer vision (ECCV), pages 770–786, 2018.
- [3] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning,, pages 1597–1607, 2020.
- [4] Yunpeng Chen, Marcus Rohrbach, Zhicheng Yan, Yan Shuicheng, Jiashi Feng, and Yannis Kalantidis. Graph-based global reasoning networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 433–442, 2019.
- [5] Tom Cornsweet. Visual perception. Academic press, 2012.
- [6] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- [7] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE international conference on computer vision, pages 1422–1430, 2015.
- [8] Basura Fernando, Hakan Bilen, Efstratios Gavves, and Stephen Gould. Self-supervised video representation learning with odd-one-out networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3636–3645, 2017.
- [9] Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The” something something” video database for learning and evaluating visual common sense. In ICCV, volume 1, page 5, 2017.
- [10] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent: A new approach to self-supervised learning. arXiv preprint arXiv:2006.07733, 2020.
- [11] Michael Gutmann and Aapo Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 297–304, 2010.
- [12] Hakim Hafidi, Mounir Ghogho, Philippe Ciblat, and Ananthram Swami. Graphcl: Contrastive self-supervised learning of graph representations. arXiv preprint arXiv:2007.08025, 2020.
- [13] Tengda Han, Weidi Xie, and Andrew Zisserman. Video representation learning by dense predictive coding. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 0–0, 2019.
- [14] J Hans and Johannes RM Cruysberg. The visual system. In Clinical Neuroanatomy, pages 409–453. Springer, 2020.
- [15] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9729–9738, 2020.
- [16] Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. Action genome: Actions as compositions of spatio-temporal scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10236–10247, 2020.
- [17] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
- [18] Dahun Kim, Donghyeon Cho, and In So Kweon. Self-supervised video representation learning with space-time cubic puzzles. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 8545–8552, 2019.
- [19] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
- [20] Alexander Klaser, Marcin Marszałek, and Cordelia Schmid. A spatio-temporal descriptor based on 3d-gradients. In BMVC 2008-19th British Machine Vision Conference, pages 275–1, 2008.
- [21] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
- [22] Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, and Thomas Serre. Hmdb: a large video database for human motion recognition. In 2011 International Conference on Computer Vision, pages 2556–2563. IEEE, 2011.
- [23] Ivan Laptev. On space-time interest points. International journal of computer vision, 64(2-3):107–123, 2005.
- [24] Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Colorization as a proxy task for visual understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6874–6883, 2017.
- [25] Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. Unsupervised representation learning by sorting sequences. In Proceedings of the IEEE International Conference on Computer Vision, pages 667–676, 2017.
- [26] Ji Lin, Chuang Gan, and Song Han. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE International Conference on Computer Vision, pages 7083–7093, 2019.
- [27] Yang Liu, Zhaoyang Lu, Jing Li, and Tao Yang. Hierarchically learned view-invariant representations for cross-view action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 29(8):2416–2430, 2018.
- [28] Yang Liu, Zhaoyang Lu, Jing Li, Tao Yang, and Chao Yao. Global temporal representation based cnns for infrared action recognition. IEEE Signal Processing Letters, 25(6):848–852, 2018.
- [29] Yang Liu, Zhaoyang Lu, Jing Li, Tao Yang, and Chao Yao. Deep image-to-video adaptation and fusion networks for action recognition. IEEE Transactions on Image Processing, 29:3168–3182, 2019.
- [30] Margaret Livingstone and David Hubel. Segregation of form, color, movement, and depth: anatomy, physiology, and perception. Science, 240(4853):740–749, 1988.
- [31] Guillaume Lorre, Jaonary Rabarisoa, Astrid Orcesi, Samia Ainouz, and Stephane Canu. Temporal contrastive pretraining for video action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 662–670, 2020.
- [32] Dezhao Luo, Chang Liu, Yu Zhou, Dongbao Yang, Can Ma, Qixiang Ye, and Weiping Wang. Video cloze procedure for self-supervised spatio-temporal learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 11701–11708, 2020.
- [33] David Milner and Mel Goodale. The visual brain in action, volume 27. OUP Oxford, 2006.
- [34] Ishan Misra, C Lawrence Zitnick, and Martial Hebert. Shuffle and learn: unsupervised learning using temporal order verification. In European Conference on Computer Vision, pages 527–544, 2016.
- [35] Tam V Nguyen, Zheng Song, and Shuicheng Yan. Stap: Spatial-temporal attention-aware pooling for action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 25(1):77–86, 2014.
- [36] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pages 69–84, 2016.
- [37] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
- [38] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016.
- [39] Xiaojiang Peng, Limin Wang, Xingxing Wang, and Yu Qiao. Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. Computer Vision and Image Understanding, 150:109–125, 2016.
- [40] Jiezhong Qiu, Qibin Chen, Yuxiao Dong, Jing Zhang, Hongxia Yang, Ming Ding, Kuansan Wang, and Jie Tang. Gcc: Graph contrastive coding for graph neural network pre-training. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1150–1160, 2020.
- [41] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pages 568–576, 2014.
- [42] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
- [43] Li Tao, Xueting Wang, and Toshihiko Yamasaki. Self-supervised video representation learning using inter-intra contrastive framework. In Proceedings of the 28th ACM International Conference on Multimedia, pages 2193–2201, 2020.
- [44] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
- [45] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6450–6459, 2018.
- [46] David C Van Essen and Jack L Gallant. Neural mechanisms of form and motion processing in the primate visual system. Neuron, 13(1):1–10, 1994.
- [47] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In International Conference on Learning Representations, 2018.
- [48] Guangrun Wang, Keze Wang, and Liang Lin. Adaptively connected neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1781–1790, 2019.
- [49] Heng Wang, Alexander Kläser, Cordelia Schmid, and Cheng-Lin Liu. Dense trajectories and motion boundary descriptors for action recognition. International journal of computer vision, 103(1):60–79, 2013.
- [50] Heng Wang and Cordelia Schmid. Action recognition with improved trajectories. In Proceedings of the IEEE international conference on computer vision, pages 3551–3558, 2013.
- [51] Jiangliu Wang, Jianbo Jiao, Linchao Bao, Shengfeng He, Wei Liu, and Yun-Hui Liu. Self-supervised video representation learning by uncovering spatio-temporal statistics. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
- [52] Jiangliu Wang, Jianbo Jiao, Linchao Bao, Shengfeng He, Yunhui Liu, and Wei Liu. Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4006–4015, 2019.
- [53] Jiangliu Wang, Jianbo Jiao, and Yun-Hui Liu. Self-supervised video representation learning by pace prediction. In European Conference on Computer Vision, pages 504–521. Springer, 2020.
- [54] Limin Wang, Yu Qiao, and Xiaoou Tang. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4305–4314, 2015.
- [55] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks for action recognition in videos. IEEE transactions on pattern analysis and machine intelligence, 41(11):2740–2755, 2018.
- [56] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803, 2018.
- [57] Xiaolong Wang and Abhinav Gupta. Unsupervised learning of visual representations using videos. In Proceedings of the IEEE international conference on computer vision, pages 2794–2802, 2015.
- [58] Xiaolong Wang and Abhinav Gupta. Videos as space-time region graphs. In Proceedings of the European conference on computer vision (ECCV), pages 399–417, 2018.
- [59] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, 2020.
- [60] Jun Xiao, Lin Li, Dejing Xu, Chengjiang Long, Jian Shao, Shifeng Zhang, Shiliang Pu, and Yueting Zhuang. Explore video clip order with self-supervised and curriculum learning for video applications. IEEE Transactions on Multimedia, 2020.
- [61] Dejing Xu, Jun Xiao, Zhou Zhao, Jian Shao, Di Xie, and Yueting Zhuang. Self-supervised spatiotemporal learning via video clip order prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10334–10343, 2019.
- [62] Yuan Yao, Chang Liu, Dezhao Luo, Yu Zhou, and Qixiang Ye. Video playback rate perception for self-supervised spatio-temporal representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6548–6557, 2020.
- [63] Jingran Zhang, Fumin Shen, Xing Xu, and Heng Tao Shen. Temporal reasoning graph for activity recognition. IEEE Transactions on Image Processing, 29:5491–5506, 2020.
- [64] Ziwei Zhang, Peng Cui, and Wenwu Zhu. Deep learning on graphs: A survey. IEEE Transactions on Knowledge and Data Engineering, 2020.
- [65] Hengshuang Zhao, Yi Zhang, Shu Liu, Jianping Shi, Chen Change Loy, Dahua Lin, and Jiaya Jia. Psanet: Point-wise spatial attention network for scene parsing. In Proceedings of the European Conference on Computer Vision (ECCV), pages 267–283, 2018.
- [66] Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Torralba. Temporal relational reasoning in videos. In Proceedings of the European Conference on Computer Vision (ECCV), pages 803–818, 2018.
- [67] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2921–2929, 2016.
- [68] Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. Deep graph contrastive representation learning. arXiv preprint arXiv:2006.04131, 2020.
- [69] Tao Zhuo, Zhiyong Cheng, Peng Zhang, Yongkang Wong, and Mohan Kankanhalli. Explainable video action reasoning via prior knowledge and state transitions. In Proceedings of the 27th ACM International Conference on Multimedia, pages 521–529, 2019.