This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Temporal Contrastive Graph Learning for Video Action Recognition and Retrieval

Yang Liu
Sun-Yat-Sen University
liuy856@mail.sysu.edu.cn
   Keze Wang
DarkMatter AI
kezewang@gmail.com
   Haoyuan Lan
Sun-Yat-Sen University
lanhy5@mail2.sysu.edu.cn
   Liang Lin
Sun-Yat-Sen University
linliang@ieee.org
Abstract

Attempt to fully discover the temporal diversity and chronological characteristics for self-supervised video representation learning, this work takes advantage of the temporal dependencies within videos and further proposes a novel self-supervised method named Temporal Contrastive Graph Learning (TCGL). In contrast to the existing methods that ignore modeling elaborate temporal dependencies, our TCGL roots in a hybrid graph contrastive learning strategy to jointly regard the inter-snippet and intra-snippet temporal dependencies as self-supervision signals for temporal representation learning. To model multi-scale temporal dependencies, our TCGL integrates the prior knowledge about the frame and snippet orders into graph structures, i.e., the intra-/inter- snippet temporal contrastive graphs. By randomly removing edges and masking nodes of the intra-snippet graphs or inter-snippet graphs, our TCGL can generate different correlated graph views. Then, specific contrastive learning modules are designed to maximize the agreement between nodes in different views. To adaptively learn the global context representation and recalibrate the channel-wise features, we introduce an adaptive video snippet order prediction module, which leverages the relational knowledge among video snippets to predict the actual snippet orders. Experimental results demonstrate the superiority of our TCGL over the state-of-the-art methods on large-scale action recognition and video retrieval benchmarks.

Refer to caption
Figure 1: Illustration of the multi-scale temporal dependencies in videos. The action of handshaking contains the long-term (inter-snippet) temporal dependencies of walking forward, shaking hands, and hugging, while it also includes the short-term (intra-snippet) temporal dependencies of periodic hands and feet movement.

1 Introduction

Deep convolutional neural networks (CNNs) [21] have achieved state-of-the-art performance in many visual recognition tasks. This can be primarily attributed to the learned rich representation from well-trained networks using large-scale image/video datasets (e.g. ImageNet [6], Kinetics [17], SomethingSomething [9]) with strong supervision information [18]. However, annotating such large-scale data is laborious, expensive, and impractical, especially for complex data-based high-level tasks, such as video action understanding and video retrieval. To fully leverage the existing large amount of unlabeled data, self-supervised learning gives a reasonable way to utilize the intrinsic characteristics of unlabeled data to obtain supervisory signals, which has attracted increasing attention.

Different from image data that can be handled by defining proxy tasks (e.g., predicting relative positions of image patches [7], solving jigsaw puzzles [36], inpainting images [38], and predicting the image color channel [24]) for self-supervised learning, video data additionally contains temporal information that can be leveraged to learn the supervisory signals. Recently, a variety of approaches have been proposed such as order verification [34, 8], order prediction [25, 61, 53], speediness prediction [1, 62]. However, all of these methods consider the temporal dependency only from a single scale (i.e., short-term or long-term) and ignore the multi-scale temporal dependencies, i.e., they extract either snippet-level or frame-level features via 2D/3D CNNs and neglect to integrate these features to model temporal dependencies at multiple time scales.

In this work, we argue that modeling multi-scale temporal dependencies is beneficial for various video classification tasks. Firstly, the recent neuroscience studies [30, 46, 14, 5, 33] prove that the human visual system can perceive detailed motion information by capturing both long-term and short-term temporal dependencies. This has been inspired by several famous supervised learning methods (e.g., Nonlocal [56], PSANet [65], GloRe [4], and ACNet [48]). Secondly, an action usually consists of several temporal dependencies at both short-term and long-term timescales. As shown in Figure 1, the action of handshaking contains the long-term temporal dependencies (video snippets) of walking forward, shaking hands, and hugging, while it also includes the short-term temporal dependencies (frame-sets within a snippet) of periodic hands and feet movement. Randomly shuffling the frames or snippets cannot preserve the semantic content of the video. Actually, the short-term temporal dependencies within a video snippet is important especially for videos that contain strict temporal coherence, such as videos in SomethingSomething datasets [9]. Therefore, both short-term (e.g., intra-snippet) and long-term (e.g., inter-snippet) temporal dependencies are essential and should be jointly modeled to learn discriminative temporal representations for unlabeled videos.

Inspired by the convincing performance and high interpretability of graph convolutional networks (GCN) [19, 47, 64], several works [56, 58, 69, 16, 63] were proposed to increase the temporal diversity of videos by using GCN in such a supervised learning fashion with large amounts of labeled data. Unfortunately, due to the lack of principles or guidelines to explore the intrinsic temporal knowledge of unlabeled video data, it is quite challenging to utilize GCN for self-supervised video representation learning.

Attempt to address this issue by regarding the jointly modeling the inter-snippet and intra-snippet temporal dependencies as guidelines, this work presents a novel self-supervised approach, named Temporal Contrastive Graph Learning (TCGL), targeting at learning the multi-scale temporal dependency knowledge within videos by guiding the video snippet order prediction in an adaptive manner. Specifically, a given video is sampled into several fixed-length snippets and then randomly shuffled. For each snippet, all the frames from this snippet are sampled into several fixed-length frame-sets. We utilize 3D convolutional neural networks (CNNs) as the backbone network to extract features for these snippets and frame-sets. To preserve both inter-snippet and intra-snippet temporal dependencies within videos, we propose graph neural network (GNN) structures with prior knowledge about the snippet orders and frame-set orders. The video snippets of a video and their chronological characteristics are used to construct the inter-snippet temporal graph. Similarly, the frame-sets within a video snippet and their chronological characteristics are leveraged to construct the intra-snippet temporal graph. Furthermore, we randomly remove edges and mask node features of the intra-snippet graphs or inter-snippet graphs to generate different correlated graph views. Then, specific contrastive learning modules are designed to enhance its discriminative capability for temporal representation learning. To learn the global context representation and recalibrate the channel-wise features adaptively, we propose an adaptive video snippet order prediction module, which leverages relational knowledge among video snippets to predict the actual snippet orders. The main contributions of the paper can be summarized as follows:

  • Integrated with intra-snippet and inter-snippet temporal dependencies, we propose intra-snippet and inter-snippet temporal contrastive graphs to increase the temporal diversity among video frames and snippets in a graph contrastive self-supervised learning manner.

  • To learn the global context representation and recalibrate the channel-wise features adaptively for each video snippet, we propose an adaptive video snippet order prediction module, which employs the relational knowledge among video snippets to predict orders.

  • Extensive experiments on three networks and two downstream tasks show that the proposed method achieves state-of-the-art performance and demonstrate the great potential of the learned video representations.

The rest of the paper is organized as follows. We first review related works in Section 2, then the details of the proposed method are explained in Section 3. In Section 4, the implementation and results of the experiments are provided and analyzed. Finally, we conclude our works in Section 5.

2 Related Work

In this section, we will introduce the recent works on supervised and self-supervised video representation learning.

2.1 Supervised Video Representation Learning

Refer to caption
Figure 2: Overview of the Temporal Contrastive Graph Learning (TCGL) framework. (a) Sample and Shuffle: sample non-overlapping snippets for each video and randomly shuffle their orders. For each snippet, all the frames from this snippet are sampled into several fixed-length frame-sets. (b) Feature Extraction: use the 3D CNNs to extract the features for snippets and frame-sets. (c) Temporal Contrastive Graph Learning: intra-snippet and inter-snippet temporal contrastive graphs are constructed with the prior knowledge about the frame-set and snippet orders, see Figure 3 for more details. (d) Order Prediction: the learned snippet features from the temporal contrastive graph are adaptively forwarded through an adaptive snippet order prediction module to output the probability distribution over the possible orders.

For video representation learning, a large number of supervised learning methods have been received increasing attention. The methods include traditional methods [23, 20, 49, 50, 35, 39, 27] and deep learning methods [41, 44, 54, 28, 45, 55, 66, 26, 29]. To model and discover temporal knowledge in videos, two-stream CNNs [41] judged the video image and dense optical flow separately, then directly fused the class scores of these two networks to obtain the classification result. C3D [44] processed videos with a three-dimensional convolution kernel. Temporal Segment Networks (TSN) [55] sampled each video into several segments to model the long-range temporal structure of videos. Temporal Relation Network (TRN) [66] introduced an interpretable network to learn and reason about temporal dependencies between video frames at multiple temporal scales. Temporal Shift Module (TSM) [26] shifted part of the channels along the temporal dimension to facilitate information exchanged among neighboring frames. Although these supervised methods achieve promising performance in modeling temporal dependencies, they require large amounts of labeled videos for training an elaborate model, which is time-consuming and labor-intensive.

2.2 Self-supervised Video Representation Learning

Although there exists a large amount of videos, it may take a great effort to annotate such massive data. Self-supervised learning generates various pretext tasks to leverage abundant unlabeled data. The learned model from pretext tasks can be directly applied to downstream tasks for feature extraction or fine-tuning. Specific contrastive learning methods have been proposed, such as the NCE [11], MoCo [15], BYOL [10], SimCLR [3]. To better model topologies, contrastive learning methods on graphs [40, 68, 12] have also been attracted increasing attention.

For self-supervised video representation learning, how to effectively explore temporal information is important. Many existing works focus on the discovering of temporal information. Shuffle&Learn [34] randomly shuffled video frames and trained a network to distinguish whether these video frames are in the right order or not. Odd-one-out Network [8] proposed to identify unrelated or odd video clips. Order prediction network (OPN) [25] trained networks to predict the correct order of shuffled frames. VCOP [61] used 3D convolutional networks to predict the orders of shuffled video clips. SpeedNet [1] designed a network to detect whether a video is playing at a normal rate or sped up rate. Video-pace [53] utilized a network to identify the right paces of different video clips. In addition to focusing on the temporal dependency, Mas [52] proposed a self-supervised learning method by regressing both motion and appearance statistics along spatial and temporal dimensions. ST-puzzle [18] used space-time cubic puzzles to design pretext task. IIC [43] introduced intra-negative samples by breaking temporal relations in video clips, and use these samples to build an inter-intra contrastive framework. Though the above works utilize temporal dependency or design specific pretext tasks for video self-supervised learning, the comprehensive temporal diversity and chronological characteristics are not fully explored. In our work, we build a novel inter-intra snippet graph structure to model multi-scale temporal dependencies, and produce self-supervision signals about video snippet orders contrastively.

3 Temporal Contrastive Graph Learning

In this section, we first give a brief overview of the proposed TCGL, shown in Figure 2, which mainly consists of four stages. (1) Sample and shuffle, for each video, several snippets are uniformly sampled and shuffled. For each snippet, all the frames from this snippet are sampled into several fixed-length frame-sets; (2) Feature extraction, 3D CNNs are utilized to extract features for these snippets and frame-sets, and all 3D CNNs share the same weights; (3) Temporal contrastive graph learning, we build two kinds of temporal contrastive graph structures (intra-snippet graph and inter-snippet graph) with the prior knowledge about the frame-set orders and snippet orders. To generate different correlated graph views for specific graphs, we randomly remove edges and mask node features of the intra-snippet graphs or inter-snippet graphs. Then, we design specific contrastive losses for both the intra-snippet and inter-snippet graphs to model multi-scale temporal dependencies. This can increase the temporal diversity of video representations; (4) Order prediction, the learned snippet features from the temporal contrastive graphs are adaptively forwarded through an adaptive snippet order prediction module to output the probability distribution over the possible orders.

Refer to caption
Figure 3: Details of temporal contrastive graph learning module.

For a better presentation, we first introduce several definitions. Given a video VV, the snippets from this video are composed of continuous frames with the size c×l×h×wc\times l\times h\times w, where cc is the number of channels, ll is the number of frames, hh and ww indicate the height and width of frames. The size of the 3D convolutional kernel is t×d×dt\times d\times d, where tt is the temporal length and dd is the spatial size. We define an ordered snippet tuples as 𝐒=s1,s2,,sn\mathbf{S}=\langle s_{1},s_{2},\cdots,s_{n}\rangle, the frame-sets from snippet sis_{i} is denoted as 𝐅i=f1,f2,,fm\mathbf{F}_{i}=\langle f_{1},f_{2},\cdots,f_{m}\rangle. The subscripts here represent the chronological order. Let 𝒢=(𝒱,)\mathcal{G}=(\mathcal{V},\mathcal{E}) denote a graph, where 𝒱={v1,v2,,vN}\mathcal{V}=\{v_{1},v_{2},\cdots,v_{N}\} represents the node set and 𝒱×𝒱\mathcal{E}\in\mathcal{V}\times\mathcal{V} represents the edge set. We denote the feature matrix and the adjacency matrix as 𝐗N×F\mathbf{X}\in\mathbb{R}^{N\times F} and 𝐀{0,1}N×N\mathbf{A}\in\{0,1\}^{N\times N}, where 𝐱iF\mathbf{x}_{i}\in\mathbb{R}^{F} is the feature of viv_{i}, and 𝐀ij=1\mathbf{A}_{ij}=1 if (vi,vj)(v_{i},v_{j})\in\mathcal{E}.

3.1 Sample and Shuffle

In this stage, we randomly sample consecutive frames (snippets) from the video to construct video snippet tuples. If we sample NN snippets from a video, there are N!N! possible snippet orders. Since the snippet order prediction is purely a proxy task of the TCGL and our focus is the learning of 3D CNNs, we restrict the number of snippets of a video between 33 to 44 to alleviate the complexity of the order prediction task, inspired by the previous works [36, 61, 60]. The snippets are sampled uniformly from the video with the interval of pp frames. After sampling, the snippets are shuffled to form the snippet tuples 𝐒=s1,s2,,sn\mathbf{S}=\langle s_{1},s_{2},\cdots,s_{n}\rangle. For each snippet sis_{i}, all the frames within are uniformly divided into mm frame-sets with equal length, then we get the frame-set 𝐅i=f1,f2,,fm\mathbf{F}_{i}=\langle f_{1},f_{2},\cdots,f_{m}\rangle for snippet sis_{i}. For snippet tuples, they contain dynamic information and strict chronological relations of a video, which is essentially the long-term temporal dependency of the videos. For the frame-sets within a snippet, the frame-level temporal relation among frames provides us the short-term temporal dependency of the videos. By taking both long-term and short-term temporal dependencies into consideration, we can increase the temporal diversity more comprehensively and precisely.

3.2 Feature Extraction

To extract spatio-temporal features, we choose C3D [44], R3D [45] and R(2+1)D [45] as feature encoders. The same 3D CNNs are used for all snippets and frame-sets, as Figure 2 (b) shows. C3D is an extension from 2D CNNs for spatio-temporal representation learning since it can model the temporal information and dynamics of the videos. C3D network consists of 88 convolutional layers stacked one by one, with 55 pooling layers interleaved, and followed by two fully connected layers. The size of all convolutional kernels is 3×3×33\times 3\times 3, which is validated in previous work [44]. R3D is the 3D CNNs with residual connections. R3D block consists of two 3D convolutional layers followed by batch normalization and ReLU layers. The input and output are connected with a residual unit before the ReLU layer. R(2+1)D is similar to R3D. The 3D convolution is decomposed into two separate operations, the one is 2D spatial convolution, and the other is 1D temporal convolution.

3.3 Temporal Contrastive Graphs

Due to the effectiveness of graph convolutional network (GCN) [19, 68, 59] in modeling unregular graph-structured relationship, we use it to explore node interaction within each snippet and frame-set for modeling multi-scale temporal dependencies of videos. After obtaining the feature vectors for snippets and frame-sets, we construct two kinds of temporal contrastive graph structures: inter-snippet and intra-snippet temporal contrastive graphs, to increase the temporal diversity of videos, as shown in Figure 2 (c).

To build intra-snippet and inter-snippet temporal contrastive graphs, we take advantage of prior knowledge about the chronological relation and the corresponding feature vectors. To fix notation, we denote intra-snippet and inter-snippet graphs as 𝐆intrak=f(𝐗intrak,𝐀intrak)\mathbf{G}_{intra}^{k}=f(\mathbf{X}_{intra}^{k},\mathbf{A}_{intra}^{k}) and 𝐆inter=f(𝐗inter,𝐀inter)\mathbf{G}_{inter}=f(\mathbf{X}_{inter},\mathbf{A}_{inter}), respectively, where k=1,,mk=1,\cdots,m, mm is the number of frame-sets in a video snippet. As shown in Figure 3, the prior knowledge that the correct order of frames in frame-sets, and the correct order of snippets in snippet tuples are already known because our proxy task is video snippet order prediction. Therefore, we can utilize the prior chronological relationship to determine the edges of graphs. For example, in Figure 3, if we know that snippets (frame-sets) are ranking chronologically 12341\rightarrow 2\rightarrow 3\rightarrow 4, we can connect the temporally related nodes and disconnect the temporally unrelated nodes. Here, we take inter-snippet temporal graph 𝐆inter\mathbf{G}_{inter} as an example to clarify our temporal contrastive graph learning method.

For 𝐆inter\mathbf{G}_{inter}, we randomly remove edges and masking node features to generate two graph views 𝐆~inter1\mathbf{\widetilde{G}}_{inter}^{1} and 𝐆~inter2\mathbf{\widetilde{G}}_{inter}^{2}, and the node embeddings of two generated views are denoted as 𝐔=f(𝐗~inter1,𝐀~inter1)\mathbf{U}=f(\mathbf{\widetilde{X}}_{inter}^{1},\mathbf{\widetilde{A}}_{inter}^{1}) and 𝐕=f(𝐗~inter2,𝐀~inter2)\mathbf{V}=f(\mathbf{\widetilde{X}}_{inter}^{2},\mathbf{\widetilde{A}}_{inter}^{2}). Since different graph views provide different contexts for each node, we corrupt the original graph at both structure and attribute levels to achieve contrastive learning between node embeddings from different views. Therefore, we propose two strategies for generating graph views: removing edges and masking node features.

The edges in the original graph are randomly removed using a random masking matrix 𝐑~{0,1}N×N\mathbf{\widetilde{R}}\in\{0,1\}^{N\times N}, where the entry of 𝐑~\mathbf{\widetilde{R}} is drawn from a Bernoulli distribution 𝐑~ij(1pr)\mathbf{\widetilde{R}}_{ij}\sim\mathcal{B}(1-p_{r}) if 𝐀ij=1\mathbf{A}_{ij}=1 for the original graph, and 𝐑~ij=0\mathbf{\widetilde{R}}_{ij}=0 otherwise. Here prp_{r} denotes the probability of each edge being removed. Then, the resulting adjacency matrix can be computed as follows, where \circ is Hadamard product.

𝐀~=𝐀𝐑~\mathbf{\widetilde{A}}=\mathbf{A}\circ\mathbf{\widetilde{R}} (1)

In addition, a part of node features is masked with zeros using a random vector 𝐦~{0,1}F\mathbf{\widetilde{m}}\in\{0,1\}^{F}, where each dimension of it is drawn from a Bernoulli distribution m~i(1pm),i\widetilde{m}_{i}\sim\mathcal{B}(1-p_{m}),\forall i. Then, the generated masked features 𝐗~\mathbf{\widetilde{X}} is calculated as follows:

𝐗~=[𝐱1𝐦~;𝐱2𝐦~;;𝐱N𝐦~]\mathbf{\widetilde{X}}=[\mathbf{x}_{1}\circ\mathbf{\widetilde{m}};\mathbf{x}_{2}\circ\mathbf{\widetilde{m}};\cdots;\mathbf{x}_{N}\circ\mathbf{\widetilde{m}}]^{\top} (2)

where [;][\cdot;\cdot] is the concatenation operator. We jointly leverage these two strategies to generate graph views.

Inspired by the NCE loss [11], we propose a contrastive learning loss that distinguishes embeddings of the same node from these two distinct views from other node embeddings. Given a positive pair, the negative samples come from all other nodes in the two views (inter-view or intra-view). To compute the relationship of embeddings 𝐮\mathbf{u}, 𝐯\mathbf{v} from two views, we define the relation function ϕ(𝐮,𝐯)=f(g(𝐮),g(𝐯))\phi(\mathbf{u},\mathbf{v})=f(g(\mathbf{u}),g(\mathbf{v})), where ff is the L2 normalized dot product similarity, and gg is a non-linear projection with two-layer multi-layer perception. The pairwise contrastive objective for positive pair (𝐮i,𝐯i)(\mathbf{u}_{i},\mathbf{v}_{i}) is defined as:

(𝐮i,𝐯i)=eϕ(𝐮𝐢,𝐯𝐢)τeϕ(𝐮𝐢,𝐯𝐢)τ+k=1N𝕀[ki]eϕ(𝐮𝐢,𝐯𝐤)τ+eϕ(𝐮𝐢,𝐮𝐤)τ\ell(\mathbf{u}_{i},\mathbf{v}_{i})=\frac{e^{\frac{\phi(\mathbf{u_{i}},\mathbf{v_{i}})}{\tau}}}{e^{\frac{\phi(\mathbf{u_{i}},\mathbf{v_{i}})}{\tau}}+\sum\limits_{k=1}^{N}{\mathbf{\mathbb{I}}_{[k\neq i]}e^{\frac{\phi(\mathbf{u_{i}},\mathbf{v_{k}})}{\tau}}}+e^{\frac{\phi(\mathbf{u_{i}},\mathbf{u_{k}})}{\tau}}} (3)

where 𝕀[ki]{0,1}\mathbf{\mathbb{I}}_{[k\neq i]}\in\{0,1\} is an indicator function that equals to 11 if kik\neq i, and τ\tau is a temperature parameter, which is empirically set to 0.50.5. The first term in the denominator represents the positive pairs, the second term represents the inter-view negative pairs, the third term represents the intra-view negative pairs. Since two views are symmetric, the loss for another view is defined similarly for (𝐯i,𝐮i)\ell(\mathbf{v}_{i},\mathbf{u}_{i}). The overall contrastive loss for 𝐆inter\mathbf{G}_{inter} is defined as follows:

𝒥inter=12Ni=1N[(𝐮i,𝐯i)+(𝐯i,𝐮i)]\mathcal{J}_{inter}=\frac{1}{2N}\sum\limits_{i=1}^{N}[\ell(\mathbf{u}_{i},\mathbf{v}_{i})+\ell(\mathbf{v}_{i},\mathbf{u}_{i})] (4)

The contrastive loss for intra-snippet graphs 𝐆intrak(k=1,,m)\mathbf{G}_{intra}^{k}(k=1,\cdots,m) can be computed similarly as 𝐆inter\mathbf{G}_{inter}.

𝒥intrak=12Ni=1N[k(𝐮i,𝐯i)+k(𝐯i,𝐮i)]\mathcal{J}_{intra}^{k}=\frac{1}{2N}\sum\limits_{i=1}^{N}[\ell^{k}(\mathbf{u}_{i},\mathbf{v}_{i})+\ell^{k}(\mathbf{v}_{i},\mathbf{u}_{i})] (5)

Then, the overall temporal contrastive graph loss is defined as follows, where α\alpha and β\beta are the weights for inter-snippet graph and intra-snippet graph, respectively.

𝒥g=αk=1m𝒥intrak+β𝒥inter\mathcal{J}_{g}=\alpha\sum\limits_{k=1}^{m}\mathcal{J}_{intra}^{k}+\beta\mathcal{J}_{inter} (6)

3.4 Adaptive Order Prediction

We formulate the order prediction task as a classification task using the learned video snippet features from the temporal contrastive graph as the input and the probability distribution of orders as output. Since the features from different video snippets are correlated, we build an adaptive order prediction module that receives features from different video snippets and learns a global context embedding, then this embedding is used to recalibrate the input features from different snippets, as shown in Figure 4.

Refer to caption
Figure 4: Adaptive snippet order prediction module.

To fix notation, we assume that a video is shuffled into nn snippets, the snippet features of nodes learned from inter-snippet temporal contrastive graph learning are {𝐟1,,𝐟n}\{\mathbf{f}_{1},\cdots,\mathbf{f}_{n}\}, where 𝐟kRck(k=1,,n\mathbf{f}_{k}\in R^{c_{k}}(k=1,\cdots,n). To utilize the correlation among these snippets, we concatenate these feature vectors and obtain joint representations through a fully-connected layer:

𝐙=𝐖s[𝐟1,,𝐟n]+𝐛s\mathbf{Z}=\mathbf{W}_{s}[\mathbf{f}_{1},\cdots,\mathbf{f}_{n}]+\mathbf{b}_{s} (7)

where [,][\cdot,\cdot] denotes the concatenation operation, 𝐙ccon\mathbf{Z}\in\mathbb{R}^{c_{con}} denotes the joint representation, 𝐖s\mathbf{W}_{s} and 𝐛s\mathbf{b}_{s} are weights and bias of the fully-connected layer. We choose ccon=k=1nck2nc_{con}=\frac{\sum_{k=1}^{n}c_{k}}{2n} to restrict the model capacity and increase its generalization ability. To make use of the global context information aggregated in the joint representations 𝐙con\mathbf{Z}_{con}, we predict excitation signal for it by a fully-connected layer:

𝐄=𝐖e𝐙+𝐛e\mathbf{E}=\mathbf{W}_{e}\mathbf{Z}+\mathbf{b}_{e} (8)

where 𝐖e\mathbf{W}_{e} and 𝐛e\mathbf{b}_{e} are weights and biases of the fully-connected layer. After obtaining the excitation signal 𝐄c\mathbf{E}\in\mathbb{R}^{c}, we use it to recalibrate the input feature fkf_{k} adaptively by a simple gating mechanism:

𝐟~k=δ(𝐄)𝐟k\mathbf{\widetilde{f}}_{k}=\delta(\mathbf{E})\odot\mathbf{f}_{k} (9)

where \odot is channel-wise product operation for each element in the channel dimension, and δ()\delta(\cdot) is the ReLU function. In this way, we can allow the features of one snippet to recalibrate the features of another snippet while concurrently preserving the correlation among different snippets.

Finally, these refined features {𝐟~1,,𝐟~n}\{\mathbf{\widetilde{f}}_{1},\cdots,\mathbf{\widetilde{f}}_{n}\} are concatenated and fed into two-layer perception with soft-max to output the snippet order prediction. The cross-entropy loss is used to measure the correctness of the prediction:

𝒥o=i=1C𝐲ilog(𝐩i)\mathcal{J}_{o}=-\sum_{i=1}^{C}\mathbf{y}_{i}\textrm{log}(\mathbf{p}_{i}) (10)

where yiy_{i} and pip_{i} represent the probability that the sample belongs to the order class ii in ground-truth and prediction, respectively. CC denotes the number of all possible orders.

The overall self-supervised learning loss for TCGL is obtained by combing Eq (6) and Eq (10), where λg\lambda_{g} and λo\lambda_{o} control the contribution of 𝒥g\mathcal{J}_{g} and 𝒥o\mathcal{J}_{o}, respectively.

𝒥=λg𝒥g+λo𝒥o\mathcal{J}=\lambda_{g}\mathcal{J}_{g}+\lambda_{o}\mathcal{J}_{o} (11)

4 Experiments

In this section, we first elaborate experimental settings, and then conduct ablation studies to analyze the contribution of key components. Finally, the learned 3D CNNs are evaluated on video action recognition and video retrieval tasks, and then compared with state-of-the-art methods.

4.1 Experimental Setting

Datasets. We evaluate our method on three action recognition datasets, UCF101 [42], HMDB51 [22] and Kinetics-400 [17]. UCF101 is collected from websites containing 101101 action classes with 9.59.5k videos for training and 3.53.5k videos for testing. HMDB51 is collected from various sources with 5151 action classes and 3.43.4k videos for training and 1.41.4k videos for testing. Kinetics-400 is a large-scale action recognition dataset, which contains 400 action classes and around 306k videos. In this work , we use the training split (around 240k videos) as the pre-training dataset.

Network Architecture. We use PyTorch [37] to implement the whole framework. For video encoder, C3D, R3D and R(2+1)D are used as backbones, where the kernel size of 3D convolutional layers is set to 3×3×33\times 3\times 3. The R3D network is implemented with no repetitions in conv{2-5}_x, which results in 99 convolution layers in total. The C3D network is modified by replacing the two fully connected layers with global spatiotemporal pooling layers. The R(2+1)D network has the same architecture as the R3D network with only 3D kernels decomposed. Dropout layers are applied between fully-connected layers with p=0.1p=0.1. Our GCN for both inter-snippet and intra-snippet graphs consist of one graph convolutional layer with 512512 output channels.

Parameters. Following the settings in [61, 62], we set the snippet length of input video as 1616, the interval length is set as 88, the number of snippets per tuple is 33, and the number of frame-sets within each snippet is 44. During training, we randomly split 800800 videos from the training set as the validation set. Video frames are resized to 128×171128\times 171 and then randomly cropped to 112×112112\times 112. We set the parameters λg=λo=1\lambda_{g}=\lambda_{o}=1 to balance the contribution between temporal contrastive graph module and adaptive order prediction module. The weights α\alpha and β\beta are both set to 11 according to the ablation study result. To optimize the framework, we use mini-bach stochastic gradient descent with the batchsize 1616, the initial learning rate 0.0010.001, the momentum 0.90.9 and the weight decay 0.00050.0005. The training process lasts for 300300 epochs and the learning rate is decreased to 0.00010.0001 after 150150 epochs. To make temporal contrastive graphs sensitive to subtle variance between different graph views, the parameters prp_{r} and pmp_{m} for generating graph view 1 are empirically set to 0.20.2 and 0.10.1, and pr=pm=0p_{r}=p_{m}=0 for generating graph view 2. And the values of prp_{r} and pmp_{m} are the same for both inter-snippet and intra-snippet graphs. The model with the lowest validation loss is saved to the best model.

4.2 Ablation Study

In this subsection, we conduct ablation studies on the first split of UCF101 with R3D as the backbone, to analyze the contribution of each component of our TCGL.

The number of snippets. The results of R3D on the snippet order prediction task with different number of snippets are shown in Table 1. The prediction accuracy decreases when the number of snippets increases because the difficulty of the prediction task grows when the snippets number increase, which makes the model hard to learn. Therefore, we use 33 snippets per video to make a compromise between task complexity and prediction accuracy.

The number of frame-sets. Since the snippet length is 1616, the number of frame-sets within each snippet can be 1,2,4,8,161,2,4,8,16. When the number is 1616, the frame-set only contains static information without temporal information. When the number is 11 or 22, it is hard to model the intra-snippet temporal relationship with too few frame-sets. From Table 2, we can observe that more frame-sets within a snippet will make the intra-snippet temporal modeling more difficult, which degrades the order prediction performance. Therefore, we choose 44 frame-sets per snippet in the experiments for short-term temporal modeling.

The intra-snippet and inter-snippet graphs. To analyze the contribution of intra-snippet and inter-snippet temporal contrastive graphs, we set different values to α\alpha and β\beta, shown in Table 3. To be noticed, removing intra-snippet graph will degrade the performance significantly even with the inter-snippet graph, which verifies the importance of intra-snippet graphs for modeling short-term temporal dependency. Additionally, when setting the weight values of intra-snippet graph and inter-snippet graph to 11 and 0.10.1, the prediction accuracy is 80.2%80.2\%. While exchanging their weight values, the accuracy drops to 54.9%54.9\%, which validates the importance of intra-snippet graphs in modeling short-term temporal dependency. In addition, the performance of TCGL drops significantly when removing either or both of the graphs. When α=β=1\alpha=\beta=1, the prediction accuracy is the best (83.0%83.0\%). These results validate that both intra-snippet and inter-snippet temporal contrastive graphs are essential for increasing the temporal diversity of features.

Model Snippet Length Snippets Number Accuracy
R3D 16 2 53.4
R3D 16 3 80.2
R3D 16 4 61.1
Table 1: Snippet order prediction accuracy (%) with different number of snippets within each video.
Model Snippet Length Frame-set Number Accuracy
R3D 16 1 54.1
R3D 16 2 54.1
R3D 16 4 80.2
R3D 16 8 63.1
Table 2: Snippet order prediction accuracy (%) with various number of frame-sets within each snippet.
Model Intra (α\alpha) Inter (β\beta) Prediction Recognition
R3D 0 0 55.6 57.3
R3D 0 1 55.6 56.0
R3D 1 0 76.6 60.9
R3D 0.1 1 54.9 56.7
R3D 1 0.1 80.2 66.8
R3D 1 1 83.0 67.6
Table 3: Snippet order prediction and action recognition accuracy (%) with different weights (α\alpha, β\beta) of intra-/inter- snippet graphs.
Method Model ASOR Prediction Recognition
TCGL R3D 78.4 62.8
TCGL R3D 80.2 67.6
Table 4: Snippet order prediction and action recognition accuracy (%) with/without adaptive snippet order prediction (ASOR).

The adaptive snippet order prediction. To analyze the contribution of our proposed adaptive snippet order prediction (ASOR) module, we remove this module and merely feed the concatenated features into multi-layer perception with soft-max to output the final snippet order prediction. It can be observed in Table 4 that our TCGL performs better than the TCGL without the ASOR module in both order prediction and action recognition tasks. This verifies that the ASOR module can better utilize relational knowledge among video snippets than simple concatenation.

The methods of introducing noises. To justify our superiority on modeling both feature and topology levels, we make comparisons with three simple graph corruption methods: (1) adding random Gaussian noise; (2) randomly removing edges; (3) randomly removing nodes. The (order prediction, action recognition) accuracies (%) for them are (56.1,51.856.1,51.8), (55.2,53.455.2,53.4) and (54.9,52.754.9,52.7), respectively, while we achieve (80.2,67.680.2,67.6), justifying the our superiority on modeling both feature and topology levels.

Method Backbone Pretrain UCF101 HMDB51
Object Patch [57] AlexNet UCF101 42.7 15.6
Shuffle [34] CaffeNet UCF101 50.9 19.8
OPN [25] VGG UCF101 56.3 22.1
Deep RL [2] CaffeNet UCF101 58.6 25.0
Random (Baseline) C3D UCF101 61.8 24.7
Mas [52] C3D UCF101 58.8 32.6
VCOP[61] C3D UCF101 65.6 28.4
COP[60] C3D UCF101 66.9 31.8
PRP [62] C3D UCF101 69.1 34.5
ST-puzzle [18] C3D K400 60.6 28.3
Mas [52] C3D K400 61.2 33.4
STS [51] C3D K400 71.8 37.8
TCGL (Ours) C3D UCF101 69.5 35.1
TCGL (Ours) C3D K400 75.2 38.9
Random (Baseline) R3D UCF101 54.5 23.4
VCOP [61] R3D UCF101 64.9 29.5
COP [60] R3D UCF101 66.0 28.0
TCP [31] R3D UCF101 64.8 34.7
PRP [62] R3D UCF101 66.5 29.7
ST-puzzle [18] R3D K400 65.8 33.7
DPC [13] R3D K400 68.2 34.5
TCP [31] R3D K400 70.5 41.1
TCGL (Ours) R3D UCF101 67.6 30.8
TCGL (Ours) R3D K400 76.8 41.5
Random (Baseline) R(2+1)D UCF101 55.8 22.0
VCP [32] R(2+1)D UCF101 66.3 32.2
VCOP [61] R(2+1)D UCF101 72.4 30.9
STS [51] R(2+1)D UCF101 73.6 34.1
COP [60] R(2+1)D UCF101 74.5 34.8
PRP [62] R(2+1)D UCF101 72.1 35.0
V-pace [53] R(2+1)D UCF101 75.9 35.9
V-pace [53] R(2+1)D K400 77.1 36.6
TCGL (Ours) R(2+1)D UCF101 74.9 36.2
TCGL (Ours) R(2+1)D K400 77.6 39.7
Table 5: Comparison with the state-of-the-art self-supervised learning methods on UCF101 and HMDB51 datasets.

4.3 Action Recognition

To verify the effectiveness of our TCGL in action recognition, we initialize the backbones with the model pretrained on the first split of UCF101 or the whole K400 training-set, and fine-tune on UCF101 and HMDB51, the fine-tuning stops after 150150 epochs. The features extracted by the backbones are fed into fully-connected layers to obtain the prediction. For testing, we sample 1010 clips for each video to generate clip predictions, and then average these predictions to obtain the final prediction results. The average classification accuracy over three splits is reported and compared with other self-supervised methods in Table 5. The “Random” means the model is randomly initialized without pre-training. When pre-trained on UCF101, we outperform the current best-performing method PRP [62]. When pre-trained on K400, we outperform the current best-performing method V-pace [53]. In addition, we consistently outperform the other state-of-the-art methods and random initialization method for all evaluation metrics. Furthermore, when pre-trained on UCF101, we achieve better accuracies than some K400 pre-trained methods (Mas [52] and ST-puzzle [18]). These results validate that our TCGL can effectively increase the temporal diversity of videos and learn discriminative spatio-temporal representations.

Method Backbone Pretrain T1 T5 T10 T20 T50
Jigsaw [36] AlexNet UCF 19.7 28.5 33.5 40.0 49.4
OPN [25] VGG UCF 19.9 28.7 34.0 40.6 51.6
Deep RL [2] CaffeNet UCF 25.7 36.2 42.2 49.2 59.5
SpeedNet [1] S3D-G K400 13.0 28.1 37.5 49.5 65.0
Random C3D UCF 16.7 27.5 33.7 41.4 53.0
VCOP [61] C3D UCF 12.5 29.0 39.0 50.6 66.9
PRP [62] C3D UCF 23.2 38.1 46.0 55.7 68.4
V-pace [53] C3D UCF 20.0 37.4 46.9 58.5 73.1
TCGL (Ours) C3D UCF 22.5 40.7 49.8 59.9 73.3
TCGL (Ours) C3D K400 23.6 41.2 50.1 60.4 74.2
Random R3D UCF 9.9 18.9 26.0 35.5 51.9
VCOP [61] R3D UCF 14.1 30.3 40.4 51.1 66.5
PRP [62] R3D UCF 22.8 38.5 46.7 55.2 69.1
V-pace [53] R3D UCF 19.9 36.2 46.1 55.6 69.2
TCGL (Ours) R3D UCF 23.3 39.6 48.4 58.8 72.4
TCGL (Ours) R3D K400 23.9 43.0 53.0 62.9 75.7
Random R(2+1)D UCF 10.6 20.7 27.4 37.4 53.1
VCOP [61] R(2+1)D UCF 10.7 25.9 35.4 47.3 63.9
PRP [62] R(2+1)D UCF 20.3 34.0 41.9 51.7 64.2
V-pace [53] R(2+1)D UCF 17.9 34.3 44.6 55.5 72.0
TCGL (Ours) R(2+1)D UCF 20.6 36.2 45.5 56.2 72.1
TCGL (Ours) R(2+1)D K400 21.9 40.2 49.6 59.7 73.1
Table 6: Video retrieval result (%) on UCF101. ‘T’ denotes ‘Top’.
Method Backbone Pretrain T1 T5 T10 T20 T50
Random C3D UCF 7.4 20.5 31.9 44.5 66.3
VCOP [61] C3D UCF 7.4 22.6 34.4 48.5 70.1
PRP [62] C3D UCF 10.5 27.2 40.4 56.2 75.9
V-pace [53] C3D UCF 8.0 25.2 37.8 54.4 77.5
TCGL (Ours) C3D UCF 10.7 28.6 41.1 57.9 77.7
TCGL (Ours) C3D K400 12.3 30.4 42.9 59.1 79.2
Random R3D UCF 6.7 18.3 28.3 43.1 67.9
VCOP [61] R3D UCF 7.6 22.9 34.4 48.8 68.9
PRP [62] R3D UCF 8.2 25.8 38.5 53.3 75.9
V-pace [53] R3D UCF 8.2 24.2 37.3 53.3 74.5
TCGL (Ours) R3D UCF 10.9 29.5 42.9 56.9 76.8
TCGL (Ours) R3D K400 11.2 30.6 43.8 58.1 78.0
Random R(2+1)D UCF 4.5 14.8 23.4 38.9 63.0
VCOP [61] R(2+1)D UCF 5.7 19.5 30.7 45.8 67.0
PRP [62] R(2+1)D UCF 8.2 25.3 36.2 51.0 73.0
V-pace [53] R(2+1)D UCF 10.1 24.6 37.6 54.4 77.1
TCGL (Ours) R(2+1)D UCF 11.1 30.4 43.0 56.5 77.4
TCGL (Ours) R(2+1)D K400 13.2 33.5 46.4 59.3 80.2
Table 7: Video retrieval result (%) on HMDB51.‘T’ denotes ‘Top’.

4.4 Video Retrieval

To further verify the effectiveness of our TCGL in video retrieval, we test our TCGL on the nearest-neighbor video retrieval. Since the video retrieval task is conducted with features extracted by the backbone network without fine-tuning, its performance largely relies upon the representative capacity of self-supervised model. The experiment is conducted on the first split of UCF101 or the whole training-set of K400, following the protocol in [61, 62]. In video retrieval, we extract video features from the backbone pre-trained by TCGL. Each video in the testing set is used to query kk nearest videos from the training set using the cosine distance. When the class of a test video appears in the classes of kk nearest training videos, it is considered as the correct predicted video. We show top-1, top-5, top-10, top-20, and top-50 retrieval accuracies on UCF101 and HMDB51 datasets, and compare our method with other self-supervised methods, as shown in Table 6 and 7. For all backbones, our TCGL outperforms the state-of-the-art methods on nearly all evaluation metrics by substantial margins. Figure 5 visualizes a query video snippet and its top-3 nearest neighbors from the UCF101 training set using the TCGL embedding. It can be observed that the representation learned by the TCGL has the ability to retrieve videos with the same semantic meaning. To have a better understanding of what TCGL learns, we follow the Class Activation Map [67] to visualize the spatio-temporal regions, as shown in Figure 6. These examples exhibit a strong correlation between highly activated regions and the dominant movement in the scene. This validates that our TCGL can learn discriminative temporal representations for videos.

Refer to caption
Figure 5: Video retrieval results with TCGL representations.
Refer to caption
Figure 6: Activation maps on UCF101 and HMDB51 datasets.

5 Conclusion

In this paper, we propose a novel Temporal Contrastive Graph Learning (TCGL) approach for self-supervised video representation learning. With inter-intra snippet graph contrastive learning strategy and adaptive video snippet order prediction task, the temporal diversity and multi-scale temporal dependency can be well discovered. The proposed TCGL is applied to video action recognition and video retrieval tasks with three kinds of 3D CNNs. Extensive experiments demonstrate the superiority of our TCGL over the state-of-the-art methods on large-scale benchmarks.

Acknowledgements. This work was supported in part by the National Natural Science Foundation of China under Grant 62002395, the China Postdoctoral Science Foundation funded project under Grant 2020M672966, and the Fundamental Research Funds for the Central Universities under Grant 20lgpy131.

References

  • [1] Sagie Benaim, Ariel Ephrat, Oran Lang, Inbar Mosseri, William T Freeman, Michael Rubinstein, Michal Irani, and Tali Dekel. Speednet: Learning the speediness in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9922–9931, 2020.
  • [2] Uta Buchler, Biagio Brattoli, and Bjorn Ommer. Improving spatiotemporal self-supervision by deep reinforcement learning. In Proceedings of the European conference on computer vision (ECCV), pages 770–786, 2018.
  • [3] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning,, pages 1597–1607, 2020.
  • [4] Yunpeng Chen, Marcus Rohrbach, Zhicheng Yan, Yan Shuicheng, Jiashi Feng, and Yannis Kalantidis. Graph-based global reasoning networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 433–442, 2019.
  • [5] Tom Cornsweet. Visual perception. Academic press, 2012.
  • [6] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  • [7] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE international conference on computer vision, pages 1422–1430, 2015.
  • [8] Basura Fernando, Hakan Bilen, Efstratios Gavves, and Stephen Gould. Self-supervised video representation learning with odd-one-out networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3636–3645, 2017.
  • [9] Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The” something something” video database for learning and evaluating visual common sense. In ICCV, volume 1, page 5, 2017.
  • [10] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent: A new approach to self-supervised learning. arXiv preprint arXiv:2006.07733, 2020.
  • [11] Michael Gutmann and Aapo Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 297–304, 2010.
  • [12] Hakim Hafidi, Mounir Ghogho, Philippe Ciblat, and Ananthram Swami. Graphcl: Contrastive self-supervised learning of graph representations. arXiv preprint arXiv:2007.08025, 2020.
  • [13] Tengda Han, Weidi Xie, and Andrew Zisserman. Video representation learning by dense predictive coding. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 0–0, 2019.
  • [14] J Hans and Johannes RM Cruysberg. The visual system. In Clinical Neuroanatomy, pages 409–453. Springer, 2020.
  • [15] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9729–9738, 2020.
  • [16] Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. Action genome: Actions as compositions of spatio-temporal scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10236–10247, 2020.
  • [17] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
  • [18] Dahun Kim, Donghyeon Cho, and In So Kweon. Self-supervised video representation learning with space-time cubic puzzles. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 8545–8552, 2019.
  • [19] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
  • [20] Alexander Klaser, Marcin Marszałek, and Cordelia Schmid. A spatio-temporal descriptor based on 3d-gradients. In BMVC 2008-19th British Machine Vision Conference, pages 275–1, 2008.
  • [21] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [22] Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, and Thomas Serre. Hmdb: a large video database for human motion recognition. In 2011 International Conference on Computer Vision, pages 2556–2563. IEEE, 2011.
  • [23] Ivan Laptev. On space-time interest points. International journal of computer vision, 64(2-3):107–123, 2005.
  • [24] Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Colorization as a proxy task for visual understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6874–6883, 2017.
  • [25] Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. Unsupervised representation learning by sorting sequences. In Proceedings of the IEEE International Conference on Computer Vision, pages 667–676, 2017.
  • [26] Ji Lin, Chuang Gan, and Song Han. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE International Conference on Computer Vision, pages 7083–7093, 2019.
  • [27] Yang Liu, Zhaoyang Lu, Jing Li, and Tao Yang. Hierarchically learned view-invariant representations for cross-view action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 29(8):2416–2430, 2018.
  • [28] Yang Liu, Zhaoyang Lu, Jing Li, Tao Yang, and Chao Yao. Global temporal representation based cnns for infrared action recognition. IEEE Signal Processing Letters, 25(6):848–852, 2018.
  • [29] Yang Liu, Zhaoyang Lu, Jing Li, Tao Yang, and Chao Yao. Deep image-to-video adaptation and fusion networks for action recognition. IEEE Transactions on Image Processing, 29:3168–3182, 2019.
  • [30] Margaret Livingstone and David Hubel. Segregation of form, color, movement, and depth: anatomy, physiology, and perception. Science, 240(4853):740–749, 1988.
  • [31] Guillaume Lorre, Jaonary Rabarisoa, Astrid Orcesi, Samia Ainouz, and Stephane Canu. Temporal contrastive pretraining for video action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 662–670, 2020.
  • [32] Dezhao Luo, Chang Liu, Yu Zhou, Dongbao Yang, Can Ma, Qixiang Ye, and Weiping Wang. Video cloze procedure for self-supervised spatio-temporal learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 11701–11708, 2020.
  • [33] David Milner and Mel Goodale. The visual brain in action, volume 27. OUP Oxford, 2006.
  • [34] Ishan Misra, C Lawrence Zitnick, and Martial Hebert. Shuffle and learn: unsupervised learning using temporal order verification. In European Conference on Computer Vision, pages 527–544, 2016.
  • [35] Tam V Nguyen, Zheng Song, and Shuicheng Yan. Stap: Spatial-temporal attention-aware pooling for action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 25(1):77–86, 2014.
  • [36] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pages 69–84, 2016.
  • [37] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
  • [38] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016.
  • [39] Xiaojiang Peng, Limin Wang, Xingxing Wang, and Yu Qiao. Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. Computer Vision and Image Understanding, 150:109–125, 2016.
  • [40] Jiezhong Qiu, Qibin Chen, Yuxiao Dong, Jing Zhang, Hongxia Yang, Ming Ding, Kuansan Wang, and Jie Tang. Gcc: Graph contrastive coding for graph neural network pre-training. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1150–1160, 2020.
  • [41] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pages 568–576, 2014.
  • [42] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  • [43] Li Tao, Xueting Wang, and Toshihiko Yamasaki. Self-supervised video representation learning using inter-intra contrastive framework. In Proceedings of the 28th ACM International Conference on Multimedia, pages 2193–2201, 2020.
  • [44] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
  • [45] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6450–6459, 2018.
  • [46] David C Van Essen and Jack L Gallant. Neural mechanisms of form and motion processing in the primate visual system. Neuron, 13(1):1–10, 1994.
  • [47] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In International Conference on Learning Representations, 2018.
  • [48] Guangrun Wang, Keze Wang, and Liang Lin. Adaptively connected neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1781–1790, 2019.
  • [49] Heng Wang, Alexander Kläser, Cordelia Schmid, and Cheng-Lin Liu. Dense trajectories and motion boundary descriptors for action recognition. International journal of computer vision, 103(1):60–79, 2013.
  • [50] Heng Wang and Cordelia Schmid. Action recognition with improved trajectories. In Proceedings of the IEEE international conference on computer vision, pages 3551–3558, 2013.
  • [51] Jiangliu Wang, Jianbo Jiao, Linchao Bao, Shengfeng He, Wei Liu, and Yun-Hui Liu. Self-supervised video representation learning by uncovering spatio-temporal statistics. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
  • [52] Jiangliu Wang, Jianbo Jiao, Linchao Bao, Shengfeng He, Yunhui Liu, and Wei Liu. Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4006–4015, 2019.
  • [53] Jiangliu Wang, Jianbo Jiao, and Yun-Hui Liu. Self-supervised video representation learning by pace prediction. In European Conference on Computer Vision, pages 504–521. Springer, 2020.
  • [54] Limin Wang, Yu Qiao, and Xiaoou Tang. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4305–4314, 2015.
  • [55] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks for action recognition in videos. IEEE transactions on pattern analysis and machine intelligence, 41(11):2740–2755, 2018.
  • [56] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803, 2018.
  • [57] Xiaolong Wang and Abhinav Gupta. Unsupervised learning of visual representations using videos. In Proceedings of the IEEE international conference on computer vision, pages 2794–2802, 2015.
  • [58] Xiaolong Wang and Abhinav Gupta. Videos as space-time region graphs. In Proceedings of the European conference on computer vision (ECCV), pages 399–417, 2018.
  • [59] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, 2020.
  • [60] Jun Xiao, Lin Li, Dejing Xu, Chengjiang Long, Jian Shao, Shifeng Zhang, Shiliang Pu, and Yueting Zhuang. Explore video clip order with self-supervised and curriculum learning for video applications. IEEE Transactions on Multimedia, 2020.
  • [61] Dejing Xu, Jun Xiao, Zhou Zhao, Jian Shao, Di Xie, and Yueting Zhuang. Self-supervised spatiotemporal learning via video clip order prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10334–10343, 2019.
  • [62] Yuan Yao, Chang Liu, Dezhao Luo, Yu Zhou, and Qixiang Ye. Video playback rate perception for self-supervised spatio-temporal representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6548–6557, 2020.
  • [63] Jingran Zhang, Fumin Shen, Xing Xu, and Heng Tao Shen. Temporal reasoning graph for activity recognition. IEEE Transactions on Image Processing, 29:5491–5506, 2020.
  • [64] Ziwei Zhang, Peng Cui, and Wenwu Zhu. Deep learning on graphs: A survey. IEEE Transactions on Knowledge and Data Engineering, 2020.
  • [65] Hengshuang Zhao, Yi Zhang, Shu Liu, Jianping Shi, Chen Change Loy, Dahua Lin, and Jiaya Jia. Psanet: Point-wise spatial attention network for scene parsing. In Proceedings of the European Conference on Computer Vision (ECCV), pages 267–283, 2018.
  • [66] Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Torralba. Temporal relational reasoning in videos. In Proceedings of the European Conference on Computer Vision (ECCV), pages 803–818, 2018.
  • [67] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2921–2929, 2016.
  • [68] Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. Deep graph contrastive representation learning. arXiv preprint arXiv:2006.04131, 2020.
  • [69] Tao Zhuo, Zhiyong Cheng, Peng Zhang, Yongkang Wong, and Mohan Kankanhalli. Explainable video action reasoning via prior knowledge and state transitions. In Proceedings of the 27th ACM International Conference on Multimedia, pages 521–529, 2019.