11email: {cswxiang,cslzhang}@comp.polyu.edu.hk
22institutetext: DAMO Academy, Alibaba, Hangzhou, China
22email: {lllcho.lc,wb.wangbiao,xihan.wxh,xiansheng.hxs}@alibaba-inc.com
Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition
Abstract
Transformer-based methods have recently achieved great advancement on 2D image-based vision tasks. For 3D video-based tasks such as action recognition, however, directly applying spatiotemporal transformers on video data will bring heavy computation and memory burdens due to the largely increased number of patches and the quadratic complexity of self-attention computation. How to efficiently and effectively model the 3D self-attention of video data has been a great challenge for transformers. In this paper, we propose a Temporal Patch Shift (TPS) method for efficient 3D self-attention modeling in transformers for video-based action recognition. TPS shifts part of patches with a specific mosaic pattern in the temporal dimension, thus converting a vanilla spatial self-attention operation to a spatiotemporal one with little additional cost. As a result, we can compute 3D self-attention using nearly the same computation and memory cost as 2D self-attention. TPS is a plug-and-play module and can be inserted into existing 2D transformer models to enhance spatiotemporal feature learning. The proposed method achieves competitive performance with state-of-the-arts on Something-something V1 & V2, Diving-48, and Kinetics400 while being much more efficient on computation and memory cost. The source code of TPS can be found at https://github.com/MartinXM/TPS.
Keywords:
action recognition, transformer, temporal patch shift1 Introduction
Significant progresses have been achieved for video based action recognition in recent years [4, 31, 13, 40, 22], largely driven by the development of 3D Convolutional Neural Networks (3D-CNN) and their factorized versions, including I3D [4], Slowfast [13], P3D [31], TSM [22]. With the recent success of transformer-based methods on image based tasks such as image classification, segmentation and detection [8, 23, 44, 16, 36, 5], researchers have been trying to duplicate the success of transformers on image based tasks to video based tasks [2, 1, 24]. Specifically, videos are tokenized as 3D patches, then multi-head Self-Attention (SA) and Feed-Forward Networks (FFN) are utilized for spatiotemporal feature learning. However, the extra temporal dimension of video data largely increases the number of patches, which leads to an exponential explosion in computation and memory cost as the calculation of multi-head SA has a quadratic complexity.


Previous efforts to reduce the computational burden of spatiotemporal multi-head SA are mainly focused on how to factorize it into spatial and temporal domains and compute them separately [2, 1]. For example, Timesformer [2] first applies spatial-only SA and then temporal-only SA in a transformer encoder. ViViT [1] adds a few temporal-only transformer encoders after spatial-only encoders. However, these factorization methods will introduce additional parameters and computation for temporal SA calculation comparing to a spatial-only transformer network.
With the above discussions, one interesting question is: Can we endow 2D transformers the capability of temporal SA modeling without additional parameters and computational cost? To answer this question, we propose a Temporal Patch Shift (TPS) method for efficient spatiotemporal SA feature learning. In TPS, specific mosaic patterns are designed for patch shifting along the temporal dimension. The TPS operation is placed before the SA layer. For each frame, part of its patches are replaced by patches from neighboring frames. Therefore, the current frame could contain information from patches in temporal domain, and the vanilla spatial SA module can be extended to a spatiotemporal one. It is worth noting that, although the spatiotemporal self-attention computed by TPS is sparse, the spatiotemporal receptive field can be naturally expanded as the TPS layers are stacked. A special case of TPS, where the patches from neighboring frames are shifted using a “Bayer filter”, is shown in the first row of Fig. 1. We highlight the patches that are being shifted. It can be seen that, by replacing half of the patches in the current frame with patches from previous and next frames, the vanilla spatial SA is upgraded to spatiotemporal SA with a temporal receptive filed of 3. In the second row of Fig. 1, we show two examples of visualization for consecutive frames after patch shift, which indicate the motion of actions can be well presented within a single frame. The contributions of this work are summarized as follows:
-
•
We propose a Temporal Patch Shift (TPS) operator for efficient spatiotemporal SA modeling. TPS is a plug-and-play module and can be easily embedded into many existing 2D transformers without additional parameters and computation costs.
-
•
We present a Patch Shift Transformer (PST) for action recognition by placing TPS before the multi-head SA layer of transformers. The resulted PST is highly cost-effective in both computation and memory.
Extensive experiments on action recognition datasets show that TPS achieves , , and top-1 accuracy on Something-something V1 V2, Kinetics400 and Diving48, which are comparable to or better than the best Transformer models but with less computation and memory cost.
2 Related works
Action recognition is a challenging and cornerstone problem in vision. Many deep learning based methods have been proposed recently [37, 14, 34, 32, 37, 43, 30, 6, 28, 47, 46, 20, 13, 40, 39, 4, 31, 22]. Based on the employed network architecture, they can be categorized into CNN-based ones and transformer-based ones.
CNN-based methods. CNN based methods typically use 3D convolution [37, 4, 13] or 2D-CNN with temporal modeling [39, 31, 22] to construct effective backbones for action recognition. For example, C3D [37] trains a VGG model with 3D-CNN to learn spatiotemporal features from a video sequence. I3D [4] inflates all the 2D convolution filters of an Inception V1 model [35] into 3D convolutions so that ImageNet pre-trained weights can be exploited for initialization. Slowfast [13] employs a two-stream 3D-CNN model to process frames at different sampling rates and resolutions. Due to the heavy computational burden of 3D-CNN, many works attempt to enhance 2D-CNN with temporal modules [39, 31, 22, 26, 20, 38]. P3D [31] factorizes 3D convolution to 1D temporal convolution and 2D spatial convolution. TSM [22] presents an efficient shift module, which utilizes left and right shifts of sub-channels to substitute a group-wise weight-fixed 1D temporal convolution. TEA [20] employs motion excitation and multiple temporal aggregation to capture motion information and increase temporal receptive field. TEINet [26] uses motion enhanced module and depth-wise 1D convolution for efficient temporal modeling. However, CNN-based methods cannot effectively model long-range dependencies within or cross the frames, which limits their performances.
Transformer-based methods. Recently, with the advancement of transformers in 2D vision tasks [8, 23, 44, 16, 36, 5], many works have been done to apply transformers on video action recognition [2, 1, 24, 45]. Different from the temporal modeling for CNN, which is mainly implemented by 3D convolution or its factorized versions, in transformers the spatiotemporal SA is naturally introduced to explore the spatiotemporal video correlations. Intuitively, one can use all the spatiotemporal patches to directly compute the SA. However, this operation is very computation and memory expensive. Many works are then proposed to reduce the computation burdens of joint spatiotemporal SA modeling. Timesformer [2] adopts divided space-time SA, which adds temporal SA after each spatial SA. ViViT [1] increases the temporal modeling capability by adding several temporal transformer encoders on the top of spatial encoders. Video swin transformer [24] reduces both spatial and temporal dimension by using spatiotemporal local windows. Inspired by the temporal modeling methods in CNN, TokenShift [45] enhances ViT for temporal modeling by applying partial channel shifting on class tokens.
The success of temporal modules in 2D-CNN [31, 22, 26, 20, 38] motivates us to develop TPS to enhance a spatial transformer with spatiotemporal feature learning capability. Our work shares the spirits with TokenShift [45] in terms of enhancing the temporal modeling ability of transformers without extra parameters and computation cost. However, our TPS is essentially different from TokenShift. TPS models spatiotemporal SA, while TokenShift is a direct application of TSM in transformer framework, which is in the nature spatial SA with “temporal mixed token”. In addition, TPS does not rely on class token (not exists in many recent transformer models [23]) and operates directly on patches, which makes it applicable to most recent transformer models.
3 Methodology
In this section, we present in detail the proposed Temporal Patch Shift (TPS) method, which aims to turn a spatial-only transformer model into a model with spatiotemporal modeling capability. TPS is a plug-and-play operation that can be inserted into transformer with no extra parameters and little computation cost. In the following, we first describe how to build a visual transformer for videos, and then introduce the design of TPS for action recognition.
3.1 Video-based Vision Transformer
The video based transformer can be built by extending the image based ViT [8]. A video clip can be divided into non-overlapped patches. The 3D patches are flattened into vectors with denoting the temporal index with , and denoting the spatial index with . The video patches are then mapped to visual tokens with a linear embedding layer
(1) |
where is the weight of a linear layer, is learnable spatiotemporal positional embedding, represents an input spatiotemporal patch at location for transformer. We represent the whole input sequence as .
Suppose that a visual transformer contains encoders, each consisting of multi-head SA, Layer-Norm (LN) and FFN. The transformer encoder could be represented as follows:
(2) | ||||
where and denote the output features of the SA module and the FFN module for block , respectively. The multi-head SA is computed as follows (LN is neglected for convenience):
(3) | ||||
where ,, represent the query, key and value matrices for block and are weights for linear mapping, respectively. is the scaling factor that equals to query/key dimension.
Following transformer encoders, temporal and spatial averaging (or temporal averaging only if using class tokens) can be performed to obtain a single feature, which is then fed into a linear classifier. The major computation burden of transformers comes from the SA computation. Note that when full spatiotemporal SA is applied, the complexity of attention operation is , while spatial-only attention costs in total. Next, we show how to turn a spatial-only SA operator to a spatiotemporal one with TPS.
3.2 Temporal Patch Shift
Generic shift operation. We first define a generic temporal shift operation in transformers as follows:
(4) | ||||
where represent the patch features for current frame and another frame , respectively. is the number of patches, and represents the matrix of shift channels, with represents the vector of channel shifts for patch , each element of which is equal to 0 or 1. is the output image patches after shift operation.
TSM [22] uses space-invariant channel shift for temporal redundancy modeling, which is a special case of our proposed patch shift operation by shifting percent of channels (with percent of elements in equal to 1), where is a constant for all patches. In our case, we mainly explore temporal spatial mixing and shift patches in a space-variant manner, where or . To reduce the mixing space, shift pattern is introduced, which is applied repeatedly in a sliding window manner to cover all patches. For example, means shifting one patch for every two patches, therefore in Eq. 4. In practice, 2D shift patterns are designed for video data, which will be discussed in detail in the section below.
Patch shift SA. By using the proposed patch shift operation, we can turn spatial-only SA into spatiotemporal SA. Given video patches , the PatchShift function shifts the patches of each frame along the temporal dimension with pattern . As only part of patches are shifted in each frame, patches from different frames could be presented in the current frame, therefore, spatial-only SA naturally turns into a sparse spatiotemporal SA. After SA, patches from different frames are shifted back to their original locations. We follow [23] to add a relative position bias with an extension to 3D position. To keep the track of shifted patches, the 3D positions are shifted alongside. With PatchShift, the multi-head SA is computed as:
(5) | ||||
where and represent relative position bias indices and patches before and after PatchShift; is the bias matrix.

Attention | SA-Complexity |
---|---|
Joint | |
Divide | |
Sparse/Local | |
PatchShift |
Patch shift patterns. As mentioned before, in order to reduce the design space, our strategy is to employ repeated shift patterns, as it can scale up to different input sizes and is easy to implement. We adopt the following pattern design principles: a) Even spatial distribution. For each pattern, we uniformly sample the patches from the same frame to ensure they are evenly distributed. b) Reasonably large temporal receptive field. The temporal receptive field is set large enough to aggregate more temporal information. c) Adequate shift percentage. A higher percentage of shifted patches could encourage more inter-frame information communication. Various spatiotemporal SA models can be implemented with different patch shift patterns. We show several instantiations of in Fig. 5. The numbers represent the indices of the frames that the patches are from, where “0”, “-” and “+” indicate current, previous and next frames, respectively. Pattern (a) shifts a single patch from next frame to the center of current frame. Pattern (b) shifts patches with a “Bayer filter” like pattern from previous and next frames. Pattern (c) shifts patches with a temporal field of 9, with patch from current frame in the center and patches from previous and next 4 frames around it. For window size larger than the pattern size, we spatially repeat the pattern to cover all patches. We use cyclic padding in [23] for patches that exceed the temporal boundary. In the experiment section, we will discuss the design of shift patterns by extensive ablation studies.
Patch shift is an efficient spatiotemporal SA modeling method for transformers as it only costs complexity in both computation and memory, which is much less than “Joint” space-temporal SA, where and are the spatial and temporal dimension of patches. Patch shift is also more efficient than other factorized attention methods such as “Divide” [2, 1] (apply spatial-only and then temporal-only SA) and “Sparse/Local” [2, 24] (subsample in space or temporal dimension). The complexity comparison of different SA models are in Table 2.

Discussions on patch and channel shifts. Patch shift and channel shift are two zero parameter and low-cost temporal modeling methods. Patch shift is space-wise sparse and channel-wise dense, while channel shift is opposite. We show an example by applying patch shift and channel shift on three consecutive frames in Fig. 3. Here the shift operations are applied directly on RGB images for visualization, while in practice we apply shift operations on feature maps. In the output image of patch shifting, , and patches are from frames , and , respectively. For channel shift, the red, green and blue colors represent frames , and , respectively. The output of shift operation for frame is represented as in the last column. As we can see from the figure, both patch shift and channel shift can capture the motion of action. Patch shift is spatially sparse while keeping the global channel information for each patch. In contrast, channel shift uses partial channel information in exchange for temporal information from other frames. Previous studies on vision transformer [5] have shown that feature channels encode activation of different patterns or objects. Therefore, replacing partial feature channels of the current frame with other frames could potentially lose important information of patterns/objects of interest. In comparison, patch shift contains full information of channels of patches. When patch shift is employed for SA modeling, it builds a sparse spatiotemporal SA with 3D relations among patches. In comparison, channel shift can be viewed as a “mix-patch” operation, by which temporal information is fused in each patch with shared 2D SA weights. Patch shift and channel shift perform shifting operations in orthogonal directions and they are complementary in nature.
3.3 Patch Shift Transformer
Based on the proposed TPS, we can build Patch Shift Transformers (PST) for efficient and effective spatiotemporal feature learning. A PST can be built by inserting TPS into the SA module of off-the-shelf 2D transformer blocks. Therefore, our model could directly benefit from the pre-trained models on 2D recognition tasks. The details of Temporal Patch Shift blocks (TPS block for short) can be seen in Fig. 4. TPS turns spatial-only SA to spatiotemporal SA by aggregating information of patches from other temporal frames. However, it gathers information in a sparse manner and sacrifices SA within frames. To alleviate this problem, we insert one TPS block for every two SA modules (alternative shift in short) so that spatial-only SA and spatiotemporal SA could work in turns to approximate full spatiotemporal SA.
We further improve the temporal modeling ability of spatial-only SA with channel shift. Specifically, partial channels of each patch are replaced with those from previous or next frames. We call this block a temporal channel shift (TCS) block. The final PST consists of both TPS and TCS blocks. We will also implement a channel-only PST for comparison to unveil the benefits of patch shift.

4 Experiments
4.1 Dataset
Something-Something V1V2 [15] are large collections of video clips, containing daily actions interacting with common objects. They focus on object motion without differentiating manipulated objects. V1 includes 108,499 video clips, while V2 includes 220,847 video clips. Both V1 and V2 have 174 classes. Kinetics400 [17] is a large-scale dataset in action recognition, which contains 400 human action classes, with at least 400 video clips for each class. Each clip is collected from a YouTube video and then trimmed to around 10s. Diving-48 V2 [21] is a fine-grained video dataset of competitive diving, consisting of 18k trimmed video clips of 48 unambiguous dive sequences. This dataset is a challenging task for modern action recognition systems as it reduces background bias and requires modeling of long-term temporal dynamics. We use the manually cleaned V2 annotations of this dataset, which contains 15,027 videos for training and 1,970 videos for testing.
4.2 Experiment setup
Models. We choose swin transformer [23] as our backbone network and develop PST-T, PST-B with an increase in model size and FLOPs based on swin transformer Tiny and Base backbones, respectively. In ablation study, we use Swin-Tiny considering its good trade-off between performance and efficiency. We adopt 32 frames as input and the tubelet embedding strategy in ViViT [1] with patch size by default. As PST-T and PST-B are efficient models, when comparing with SOTA methods, we introduce PST-T and PST-B, which doubles the temporal attention window to 2 with slightly increased computation.
Training. For all the datasets, we first resize the short side of raw images to and then apply center cropping of . During training, we follow [24] and use random flip, AutoAugment [7] for augmentation. We utilize AdamW [27] with the cosine learning rate schedule for network training. For PST-T, the base learning rate, warmup epoch, total epoch, stochastic depth rate, weight decay, batchsize are set to , 2.5, 30, 0.1, 0.02, 64 respectively. For larger model PST-B, learning rate, drop path rate and weight decay are set to , 0.2, 0.05, respectively.
Testing. For fair comparison, we follow the testing strategy in previous state-of-the-art methods. We report the results of two different sampling strategies. On Something-something V1V2 and Diving-48 V2, uniform sampling and center-crop (or three-crop) testing are adopted. On Kinetics400, we adopt the dense sampling strategy as in [1] with 4 view, three-crop testing.
4.3 Ablation study
To investigate the design of patch shift patterns and the use of TPS blocks, we conduct a series of experiments in this section. All the experiments are conducted on Something-something V1 with Swin-Tiny as backbone (IN-1K pretrained). For experiments on design of patch shift we use patch-only PST for clarity.
The number and distribution of shifted patches. We start with a simple experiment by shifting only one center patch along temporal dimension within each window. It can be seen from Table 2(a) that this simple shift pattern (center-one) brings significant improvements ( on top-1) over the model without shifting operation (none). We then increase the number of shifting patches to of the total patches, however, in an uneven distribution (shift only the left half of patches). This uneven shift pattern does not improve over the simple “center-one” pattern. However, when the shifted patches are distributed evenly within the window (even-2), the performance increases by on top-1. This indicates that, the shifted and non-shifted patches should be distributed evenly. It is also found that a large temporal field is helpful. When we increase the temporal field to 3 by shifting 1/4 patches to previous frame and 1/4 patches to next frame (even-3), the performance is improved by on top-1 over shifting patches in one dimension.
Distribution | Top-1 | Top-5 |
---|---|---|
None | 40.6 | 71.4 |
Center-one | 45.3 | 75.1 |
Uneven | 45.3 | 75.5 |
Even-2 | 46.2 | 76.1 |
Even-3 | 48.6 | 77.8 |
Pattern | Top-1 | Top-5 |
---|---|---|
A-3 | 48.6 | 77.8 |
B-4 | 50.7 | 79.3 |
C-9 | 51.8 | 80.3 |
D-16 | 50.0 | 79.5 |
Stage | Top-1 | Top-5 | |||
---|---|---|---|---|---|
1 | 2 | 3 | 4 | ||
47.3 | 77.0 | ||||
48.4 | 77.6 | ||||
50.4 | 79.1 | ||||
51.8 | 80.3 |
Shift back | Alternative | Shift RPE | Top-1 | Top-5 |
---|---|---|---|---|
47.3 | 77.0 | |||
46.4 | 76.6 | |||
46.1 | 76.0 | |||
51.8 | 80.3 |
FLOPs | Memory | Top-1 | Top-5 | |
Avgpool | 72G | 3.7G | 40.6 | 71.4 |
Joint | 106G | 20.2G | 51.5 | 80.0 |
Local | 88G | 11G | 49.9 | 79.2 |
Sparse | 72G | 4.0G | 42.7 | 74.0 |
Channel-only | 72G | 3.7G | 51.2 | 79.7 |
Patch-only | 72G | 3.7G | 51.8 | 80.3 |
PST | 72G | 3.7G | 52.2 | 80.3 |
Patch shift patterns. Based on the experiments in Table 2(a), we design a few different patch shift patterns with various temporal fields. The patches of different frames are distributed evenly within the window. Pattern A with temporal field 3 is shown in Fig. 5(b), which is “Bayer filter” like. Pattern B with temporal field 4 is implemented by replacing of index “0” frame patches in pattern A as index “2” frame patches. Pattern C is shown in Fig. 5(c). Pattern D is designed by placing patches from 16 consecutive frames in a pattern grid. More details can be found in the supplementary materials. We can see from the Table 2(b) that the performance of TPS gradually increases when the temporal field grows. The best performance is reached at temporal field 9, which achieves a good balance between spatial and temporal dimension. We use this pattern for the rest of our experiments.
The number of stages using TPS blocks. As can be seen in Table 2(c), by using TPS blocks in more stages of the network, the performance gradually increases. The model achieves the best performance when all the stages are equipped with TPS blocks.
Shift back, alternative shift and shift RPE. Shift back operation recovers the patches’ locations and keeps the frame structure complete. Alternative shift is described in Section 3.3 for building connections between patches. Shift RPE represents whether relative positions are shifted alongside patches. As shown in Table 2(d), removing each of them would decrease performance by , respectively. Therefore, we use all the operations in our model.
Comparison with other temporal modeling methods. In Table 2(e), we compare PST with other designs of spatiotemporal attention methods in FLOPs, peak training memory consumption and accuracy. “Avgpool” achieves only Top-1 rate since it cannot distinguish temporal ordering. Joint spatiotemporal SA achieves a good performance at the price of high computation and memory cost. Local spatiotemporal attention [24] applies SA within a local window with 3D window size. It reduces the computation and memory cost but at the price of performance drop comparing to “Joint”. We also implement a sparse spatiotemporal SA by subsampling spatial patches to half, while keeping the full temporal dimension. It performs poorly as only part of patches participate in SA computation in each layer. We implement channel-only PST with shift ratio equals , which is the same in the [22]. This channel-only PST also achieves strong performance.
Patch-only PST outperforms all other temporal modeling methods without additional parameters and FLOPs. The best performance comes from combining channel-only and patch-only PST, which indicates they are complementary in nature. Specifically, PST exceeds joint spatiotemporal SA with much less computation and only 1/5 of memory usage. It also outperforms other efficient spatiotemporal SA models with fewer computation and memory cost.
Model | Pretrain | Crops Clips | FLOPs | Params | Sthv1 | Sthv2 | ||
---|---|---|---|---|---|---|---|---|
Top-1 | Top-5 | Top-1 | Top-5 | |||||
TSM [22] | K400 | 65G | 24.3M | - | - | 63.4 | 88.5 | |
TEINet [26] | IN-1K | 66G | 30.4M | 49.9 | - | 62.1 | - | |
TEA [20] | IN-1K | 70G | 24.3M | 51.9 | 80.3 | - | - | |
TDN [38] | IN-1K | 24.8M | 53.9 | 82.1 | 65.3 | 89.5 | ||
ACTION-Net [42] | IN-1K | 70G | 28.1M | - | - | 64.0 | 89.3 | |
SlowFast R101, 8x8 [13] | K400 | 106G | 53.3M | - | - | 63.1 | 87.6 | |
MSNet [18] | IN-1K | 101G | 24.6M | 52.1 | 82.3 | 64.7 | 89.4 | |
blVNet [11] | IN-1K | 40.2M | - | - | 65.2 | 90.3 | ||
Timesformer-HR [2] | IN-21K | 1703G | 121.4M | - | - | 62.5 | - | |
ViViT-L/16x2 [1] | IN-21K | 903G | 352.1M | - | - | 65.9 | 89.9 | |
MViT-B, 64×3 [9] | K400 | 455G | 36.6M | - | - | 67.7 | 90.9 | |
Mformer-L [29] | K400 | 1185G | 86M | - | - | 68.1 | 91.2 | |
X-ViT [3] | IN-21K | 283G | 92M | - | - | 66.2 | 90.6 | |
SIFAR-L [10] | K400 | 576G | 196M | - | - | 64.2 | 88.4 | |
Video-Swin [25] | K400 | 321G | 88.1M | - | - | 69.6 | 92.7 | |
PST-T | IN-1K | 72G | 28.5M | 52.2 | 80.3 | 65.7 | 90.2 | |
IN-1K | 52.8 | 80.5 | 66.4 | 90.2 | ||||
K400 | 53.2 | 82.2 | 66.7 | 90.6 | ||||
K400 | 53.6 | 82.2 | 67.3 | 90.5 | ||||
PST-T | K400 | 74G | 54.0 | 82.3 | 67.9 | 90.8 | ||
PST-B | IN-21K | 247G | 88.8M | 55.3 | 81.9 | 66.7 | 90.7 | |
IN-21K | 55.6 | 82.2 | 67.4 | 90.9 | ||||
K400 | 57.4 | 83.2 | 68.7 | 91.3 | ||||
K400 | 57.7 | 83.4 | 69.2 | 91.9 | ||||
PST-B | K400 | 252G | 58.3 | 83.9 | 69.8 | 93.0 |
4.4 Comparison with SOTA
Something V1 V2. The performance statistics on Something V1 V2, including the pretrained dataset, classification results, inference protocols, the corresponding FLOPs and parameter numbers are shown in Table 3.
The first compartment contains methods based on 3D CNNs or factorized (2+1)D CNNs. Using efficient inference protocol (16 views and center crop1 clip), ImageNet1K pretrain, PST-T obtains 52.2 accuracy on V1 and 65.7 on V2, respectively, which outperforms all the CNN-based methods with similar FLOPs. Note that TDN [38] uses a different sampling strategy comparing to other methods, which requires 5 times more frames. Even though, PST-T still outperforms TDN on larger Something-something V2.
The second compartment contains transformer-based methods. Our small model PST-T pretrained on Kinetics400 offers competitive performance on SomethingV2 comparing to these methods. PST-B is a larger model and pretrained on ImageNet21K/Kinetics400. PST-B achieves and on V1 V2, outperforming Timesformer-HR [2], ViViT-L [1] and MViT-B [9] at a lower cost on computation or parameter. PST-B also outperforms Mformer-L [29] and X-ViT [3], which are recently proposed efficient temporal modeling methods. They apply trajectory self-attention and local spatiotemporal self-attention, respectively. Our PST-B achieves and on V1 V2, which outperforms other transformer-based methods. Note that, SIFAR-L [10] uses larger backbone network Swin-L, however, its performance is less satisfactory. PST-B also outperforms Video-Swin [24], which uses full temporal window on this dataset. The performances of PST family on Something-something V1V2 confirms its remarkable ability for spatiotemporal modeling.
Kinetics400. We report our results on scene-focused Kinetics400 in Table 4 and compare them with previous state-of-the-arts. As we can see from Table 4, PST-T achieves top-1 accuracy and outperforms majority of (2+1D) CNN-based methods such as TSM [22], TEINet [26], TEA [20], TDN [38] with less total FLOPs. Our larger model PST-B achieves top-1 accuracy, which outperforms strong 3D-CNN counterparts such as SlowFast [13] and X3D [12].
Comparing to transformer-based methods such as Timesformer [2] and ViViT-L [1], our PST-B achieves with less computation overheads. Specifically, PST-B outperforms SIFAR-L [10] which adopts larger size Swin-L as backbone network. PST-B also achieves on par performance with recently developed Video-Swin [25] with less computation cost.
Model | Pretrain | Crops Clips | FLOPs | Params | Top-1 | Top-5 |
---|---|---|---|---|---|---|
I3D [4] | IN-1K | 108G | 28.0M | 72.1 | 90.3 | |
NL-I3D [41] | IN-1K | 32G | 35.3M | 77.7 | 93.3 | |
CoST [19] | IN-1K | 33G | 35.3M | 77.5 | 93.2 | |
SlowFast-R50 [13] | IN-1K | 36G | 32.4M | 75.6 | 92.1 | |
X3D-XL [12] | - | 48G | 11.0M | 79.1 | 93.9 | |
TSM [22] | IN-1K | 65G | 24.3M | 74.7 | 91.4 | |
TEINet [26] | IN-1K | 66G | 30.4M | 76.2 | 92.5 | |
TEA [20] | IN-1K | 70G | 24.3M | 76.1 | 92.5 | |
TDN [38] | IN-1K | 24.8M | 77.5 | 93.2 | ||
Timesformer-L [2] | IN-21K | 2380G | 121.4M | 80.7 | 94.7 | |
ViViT-L/16x2 [1] | IN-21K | 3980G | 310.8M | 81.7 | 93.8 | |
X-ViT [3] | IN-21K | 283G | 92M | 80.2 | 94.7 | |
MViT-B, 32×3 [9] | IN-21K | 170G | 36.6M | 80.2 | 94.4 | |
MViT-B, 64×3 [9] | IN-21K | 455G | 36.6M | 81.2 | 95.1 | |
Mformer-HR [29] | K400 | 959G | 86M | 81.1 | 95.2 | |
TokenShift-HR [45] | IN-21K | 2096G | 303.4M | 80.4 | 94.5 | |
SIFAR-L [10] | IN-21K | 576G | 196M | 82.2 | 95.1 | |
Video-Swin [24] | IN-21K | 282G | 88.1M | 82.7 | 95.5 | |
PST-T | IN-1K | 72G | 28.5M | 78.2 | 92.2 | |
PST-T | IN-1K | 74G | 28.5M | 78.6 | 93.5 | |
PST-B | IN-21K | 247G | 88.8M | 81.8 | 95.4 | |
PST-B | IN-21K | 252G | 88.8M | 82.5 | 95.6 |
Diving-48 V2. In Table 5, we further evaluate our method on fine-grained action dataset Diving-48. As older version of diving-48 has labeling errors, we compare our methods with reproduced SlowFast [13] and Timesformer in [2] on V2. PST-T achieves competitive top-1 performance with ImageNet-1K pretrain and costs much less computation resources than SlowFast, Timersformer and Timesformer-HR. When PST-T is pretrained on Kinetics400, it also outperforms Timersformer-L. Our larger model PST-B outperforms TimsFormer-L by a large margin of when both are pretrained on ImageNet-21K with 9.7 less FLOPs. When model is pretrained on Kinetic400, PST-B achieves top-1 accuracy. PST-T and PST-B further boost the performance to and , respectively.
Model | Pretrain | Crops Clips | FLOPs | Params | Top-1 | Top-5 |
---|---|---|---|---|---|---|
SlowFast R101, 8x8 [13] | K400 | 106G | 53.3M | 77.6 | - | |
Timesformer [2] | IN-21K | 196G | 121.4M | 74.9 | - | |
Timesformer-HR [2] | IN-21K | 1703G | 121.4M | 78.0 | - | |
Timesformer-L [2] | IN-21K | 2380G | 121.4M | 81.0 | - | |
PST-T | IN-1K | 72G | 28.5M | 79.2 | 98.2 | |
K400 | 72G | 81.2 | 98.7 | |||
PST-T | K400 | 74G | 82.1 | 98.6 | ||
PST-B | IN-21K | 247G | 88.1M | 83.6 | 98.5 | |
K400 | 85.0 | 98.6 | ||||
PST-B | K400 | 252G | 86.0 | 98.6 |
Latency, throughput and memory. The inference latency, throughput (videos/second) and inference memory consumption are important for action recognition in practice. We performed measurement on a single NVIDIA Tesla V100 GPU. We use batch size of 1 for latency measurement and batch size of 4 for throughput measurement. The accuracy is tested on validation set of Something-something V1&V2 dataset. The results are shown in Table 6. We make three observations: (1) Comparing to baseline “2D Swin-T” that uses 2D transformer with avgpooling for temporal modeling, PST-T has slightly higher latency, same memory consumption, but much better performance (, top-1 improvements on Sthv1&v2, respectively). (2) Comparing to “Video-Swin-T” that utilizes full scale temporal information, PST-T uses less memory and has 2 times faster inference speed, while achieves very competitive performance. (3) For larger model PST-B, it achieves better performance than Video-Swin-B while using less inference memory and less time.
Methods | FLOPs | Param | Memory | Latency | Throughput | Sthv1 | Sthv2 | ||
Top-1 | Top-5 | Top-1 | Top-5 | ||||||
2D Swin-T | 72G | 1.7G | 29ms | 35.5 v/s | 40.6 | 71.4 | 56.7 | 84.1 | |
Video-Swin-T [24] | 106G(G) | 28.5M | 3.0G(G) | 62ms(ms) | 17.7 v/s | 51.5 | 80.0 | 65.7 | 90.1 |
PST-T | 72G | 1.7G | 31ms(ms) | 34.7 v/s | 52.2 | 80.3 | 65.7 | 90.2 | |
2D Swin-B | 247G | 2.2G | 71ms | 15.5 v/s | - | - | 59.5 | 86.3 | |
Video-Swin-B [24] | 321G(G) | 88.8M | 3.6G(G) | 147ms(ms) | 7.9 v/s | - | - | 69.6 | 92.7 |
PST-B | 252G(G) | 2.4G(G) | 81ms(ms) | 13.8 v/s | - | - | 69.8 | 93.0 |
5 Conclusions
In this paper, we proposed a novel temporal patch shift method, which can be inserted into transformer blocks in a plug-and-play manner for efficient and effective spatiotemporal modeling. By comparing the patch shift and channel shift operations under transformer framework, we showed that they are two efficient temporal modeling methods and complementary in nature. PST achieved competitive performance comparing to previous methods on the datasets of Something-something V1&V2 [15], Diving-48 [21] and Kinetics400 [17]. In addition, we presented in-depth ablation studies and visualization analysis on PST to investigate the design principles of patch shift patterns, and explained why it is effective for learning spatiotemporal relations. PST achieved a good balance between accuracy and computational cost for effective action recognition.
Appendix 0.A Appendix
0.A.1 More details on the implementation of TPS
We present the “PyTorch style” pseudo-code of a special case of TPS, where the patches from neighboring frames are shifted using a “Bayer filter” pattern. The patch shift module is placed before the self-attention calculation, and then a shift back operation is employed after the self-attention calculation.
The simple patch shift function described above transfers a vanilla spatial self-attention to a spatiotemporal self-attention. Other patterns can be implemented in a similar paradigm. As we use Swin Transformer [23] as our backbone, the coordinates of patches are shifted alongside the patches.
0.A.2 More details on the shift patterns
We show the Patterns A, B, C and D in Figure 5 here. Pattern B is obtained via replacing 1/4 index-0 frame patches in pattern A by index-2 frame patches. Pattern C is obtained by placing patches from 9 consecutive frames in a 3 3 grid. Pattern D is obtained by placing patches from 16 consecutive frames in a 4 4 grid.

0.A.3 Additional results on DeiT backbone
TPS is a plug-and-play module and it can be embedded into many existing transformers to increase their spatiotemporal modeling capabilities. To illustrate the effectiveness of TPS, we further test our method on another popular backbone DeiT [36]. We use DeiT-S for our experiment, which has image patches at every layer and an additional class token. We insert a TPS module in every two blocks of DeiT and operate directly on image patches. We densely sample 16 frames as input for training. Other training and testing configurations are the same as PST-T.
We compare our proposed DeiT-S-TPS with DeiT-S-2D and DeiT-S-TSM. DeiT-S-2D uses 2D DeiT-S for frame feature extraction and average pooling for temporal modeling. DeiT-S-TSM applies temporal channel shift [22] on both image patches and class token, and the channel shift ratio is set the same as [22]. The results are show in Table 7. DeiT-S-TPS improves top-1 accuracy on Kinetics400 over DeiT-S-2D. DeiT-S-TPS also outperforms DeiT-S-TSM by thanks to the efficient spatio-temporal self-attention modeling.
Model | Pretrain | Crops Clips | FLOPs | Params | Top-1 | Top-5 |
---|---|---|---|---|---|---|
DeiT-S-2D [36] | IN-1K | 74G | 22M | 73.0 | 90.7 | |
DeiT-S-TSM | IN-1K | 74G | 22M | 74.8 | 91.6 | |
DeiT-S-TPS | IN-1K | 74G | 22M | 75.3 | 91.8 |

0.A.4 Visualization
We use GradCAM [33] for visualization of the last stage feature maps. In Fig. 7, we present three examples of consecutive frames in Something-something-V2 videos in the first three columns, the learned feature activation maps by applying average pooling, channel shift and PST are presented in fourth, fifth and sixth columns, respectively. Our results show that PST learns spatiotemporal relation by associating relevant regions, which clearly indicates the motion of actions. Channel shift can also learn motion at some extent, but the activation map indicates that the receptive field is not as broad as PST. While in feature maps of average pooling, the motion can not be easily learned.
0.A.5 More visualization results
We also present three visualization examples of PST from Kinetics400 [17], Diving48 [21] and Something-something V1 [15], respectively, in Fig. 7. The consecutive frames of each video are shown in the first three columns, and the learned feature maps by PST are presented in the last column. It can be seen that PST can learn to associate relevant regions, and the feature maps can indicate the motion of actions in the video.

References
- [1] Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lucic, M., Schmid, C.: Vivit: A video vision transformer. arXiv: Computer Vision and Pattern Recognition (2021)
- [2] Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding?. arXiv: Computer Vision and Pattern Recognition (2021)
- [3] Bulat, A., Perez-Rua, J.M., Sudhakaran, S., Martinez, B., Tzimiropoulos, G.: Space-time mixing attention for video transformer. In: NeurIPS (2021)
- [4] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 6299–6308 (2017)
- [5] Chen, C.F., Fan, Q., Panda, R.: Crossvit: Cross-attention multi-scale vision transformer for image classification. arXiv preprint arXiv:2103.14899 (2021)
- [6] Christoph, R., Pinz, F.A.: Spatiotemporal residual networks for video action recognition. NIPS pp. 3468–3476 (2016)
- [7] Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V.: Autoaugment: Learning augmentation policies from data (2019), https://arxiv.org/pdf/1805.09501.pdf
- [8] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR 2021: The Ninth International Conference on Learning Representations (2021)
- [9] Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., Feichtenhofer, C.: Multiscale vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 6824–6835 (October 2021)
- [10] Fan, Q., Chen, C., Panda, R.: An image classifier can suffice for video understanding. CoRR abs/2106.14104 (2021), https://arxiv.org/abs/2106.14104
- [11] Fan, Q., Chen, C.F.R., Kuehne, H., Pistoia, M., Cox, D.: More Is Less: Learning Efficient Video Representations by Temporal Aggregation Modules. In: Advances in Neural Information Processing Systems 33 (2019)
- [12] Feichtenhofer, C.: X3d: Expanding architectures for efficient video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020)
- [13] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Int. Conf. Comput. Vis. pp. 6202–6211 (2019)
- [14] Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 1933–1941 (2016). https://doi.org/10.1109/CVPR.2016.213, https://doi.org/10.1109/CVPR.2016.213
- [15] Goyal, R., Ebrahimi Kahou, S., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., Mueller-Freitag, M., et al.: The” something something” video database for learning and evaluating visual common sense. In: Int. Conf. Comput. Vis. pp. 5842–5850 (2017)
- [16] Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: Transformer in transformer. arXiv preprint arXiv:2103.00112 (2021)
- [17] Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
- [18] Kwon, H., Kim, M., Kwak, S., Cho, M.: Motionsqueeze: Neural motion feature learning for video understanding. In: ECCV (2020)
- [19] Li, C., Zhong, Q., Xie, D., Pu, S.: Collaborative spatiotemporal feature learning for video action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)
- [20] Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., Wang, L.: Tea: Temporal excitation and aggregation for action recognition. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 909–918 (2020)
- [21] Li, Y., Li, Y., Vasconcelos, N.: Resound: Towards action recognition without representation bias. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 513–528 (2018)
- [22] Lin, J., Gan, C., Han, S.: Tsm: Temporal shift module for efficient video understanding. In: Int. Conf. Comput. Vis. pp. 7083–7093 (2019)
- [23] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030 (2021)
- [24] Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., Hu, H.: Video swin transformer. arXiv: Computer Vision and Pattern Recognition (2021)
- [25] Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., Hu, H.: Video swin transformer. arXiv preprint arXiv:2106.13230 (2021)
- [26] Liu, Z., Luo, D., Wang, Y., Wang, L., Tai, Y., Wang, C., Li, J., Huang, F., Lu, T.: Teinet: Towards an efficient architecture for video recognition. In: AAAI (2020)
- [27] Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net (2019), https://openreview.net/forum?id=Bkg6RiCqY7
- [28] Martinez, B., Modolo, D., Xiong, Y., Tighe, J.: Action recognition with spatial-temporal discriminative filter banks. In: Int. Conf. Comput. Vis. pp. 5482–5491 (2019)
- [29] Patrick, M., Campbell, D., Asano, Y.M., Metze, I.M.F., Feichtenhofer, C., Vedaldi, A., Henriques, J.F.: Keeping your eye on the ball: Trajectory attention in video transformers. In: Advances in Neural Information Processing Systems (NeurIPS) (2021)
- [30] Piergiovanni, A., Ryoo, M.S.: Representation flow for action recognition. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 9945–9953 (2019)
- [31] Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3d residual networks. In: Int. Conf. Comput. Vis. pp. 5533–5541 (2017)
- [32] Rohrbach, M., Rohrbach, A., Regneri, M., Amin, S., Andriluka, M., Pinkal, M., Schiele, B.: Recognizing fine-grained and composite activities using hand-centric features and script data. IJCV 119(3), 346–373 (2016)
- [33] Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: Visual explanations from deep networks via gradient-based localization. In: 2017 IEEE International Conference on Computer Vision (ICCV). pp. 618–626 (2017). https://doi.org/10.1109/ICCV.2017.74
- [34] Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS. pp. 568–576 (2014), http://papers.nips.cc/paper/5353-two-stream-convolutional-networks-for-action-recognition-in-videos
- [35] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Computer Vision and Pattern Recognition (CVPR) (2015), http://arxiv.org/abs/1409.4842
- [36] Touvron, H., Cord, M., Matthijs, D., Massa, F., Sablayrolles, A., Jegou, H.: Training data-efficient image transformers & distillation through attention. In: ICML 2021: 38th International Conference on Machine Learning (2021)
- [37] Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Int. Conf. Comput. Vis. pp. 4489–4497 (2015). https://doi.org/10.1109/ICCV.2015.510, https://doi.org/10.1109/ICCV.2015.510
- [38] Wang, L., Tong, Z., Ji, B., Wu, G.: Tdn: Temporal difference networks for efficient action recognition. In: IEEE Conf. Comput. Vis. Pattern Recog. (2021)
- [39] Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Val Gool, L.: Temporal segment networks: Towards good practices for deep action recognition. In: Eur. Conf. Comput. Vis. (2016)
- [40] Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: Towards good practices for deep action recognition. In: Eur. Conf. Comput. Vis. pp. 20–36. Springer (2016)
- [41] Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. IEEE Conf. Comput. Vis. Pattern Recog. (2018)
- [42] Wang, Z., She, Q., Smolic, A.: Action-net: Multipath excitation for action recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2021)
- [43] Yang, C., Xu, Y., Shi, J., Dai, B., Zhou, B.: Temporal pyramid network for action recognition. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 591–600 (2020)
- [44] Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Tay, F.E.H., Feng, J., Yan, S.: Tokens-to-token vit: Training vision transformers from scratch on imagenet. arXiv preprint arXiv:2101.11986 (2021)
- [45] Zhang, H., Hao, Y., Ngo, C.W.: Token shift transformer for video classification. arXiv: Computer Vision and Pattern Recognition (2021)
- [46] Zhao, Y., Xiong, Y., Lin, D.: Recognize actions by disentangling components of dynamics. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 6566–6575 (2018)
- [47] Zheng, Y.D., Liu, Z., Lu, T., Wang, L.: Dynamic sampling networks for efficient action recognition in videos. TIP 29, 7970–7983 (2020)