This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Augmented 2D-TAN: A Two-stage Approach for Human-centric Spatio-Temporal Video Grounding

Chaolei Tan1111Equal contribution.    Zihang Lin1111Equal contribution.    Jian-Fang Hu1    Xiang Li2    Wei-Shi Zheng1
1Sun Yat-sen University
   China 2Meituan
{tanchlei, linzh59}@mail2.sysu.edu.cn, hujf5@mail.sysu.edu.cn
lixiang82@meituan.com, wszheng@ieee.org
Abstract

We propose an effective two-stage approach to tackle the problem of language-based Human-centric Spatio-Temporal Video Grounding (HC-STVG) task. In the first stage, we propose an Augmented 2D Temporal Adjacent Network (Augmented 2D-TAN) to temporally ground the target moment corresponding to the given description. Primarily, we improve the original 2D-TAN[6] from two aspects: First, a temporal context-aware Bi-LSTM Aggregation Module is developed to aggregate clip-level representations, replacing the original max-pooling. Second, we propose to employ Random Concatenation Augmentation (RCA) mechanism during the training phase. In the second stage, we use pretrined MDETR[4] model to generate per-frame bounding boxes via language query, and design a set of hand-crafted rules to select the best matching bounding box outputted by MDETR for each frame within the grounded moment.

1 Methodology

Human-centric spatio-temporal video grounding (HC-STVG) task [5] aims to localize a spatio-temporal tube of the target person indicated by a language description. We tackle this problem with a two-stage approach which first localizes the temporal segment and then predicts the spatial location of the target person in each frame. In the first stage, we develop a temporal grounding model named Augmented 2D-TAN which predicts the temporal boundary (i.e., the start time and end time) of the target video segment corresponding to the given description. In the second stage, we generate per-frame proposals (bounding boxes) using MDETR[4] and find the best matching bounding box for each frame in the localized segment.

1.1 Augmented 2D-TAN for Temporal Localization

We develop our Augmented 2D-TAN (as illustrated in Figure 1) grounding method based on 2D-TAN (2D Temporal Adjacent Networks)[6]. In this section, we will first briefly review 2D-TAN and its weaknesses, and then introduce how we improve it.

Refer to caption
Figure 1: Framework of Augmented 2D-TAN.

Review of 2D-TAN. 2D-TAN is a video grounding algorithm which first splits the input video into NN clips and extracts visual features111We use SlowFast Network[2] pretrained on Kinetics-600[[1] and AVA[3] to extract visual features. for each clip. Then dense moment (a sequence of clips) proposals 𝒫\mathcal{P} are generated as 𝒫={Pij=[Ci,Ci+1,,Cj]|1ijN}\mathcal{P}=\{P_{ij}=\left[C_{i},C_{i+1},...,C_{j}\right]|1\leq i\leq j\leq N\} where moment PijP_{ij} contains the ii-th clip CiC_{i} to the jj-th clip CjC_{j}. The feature of each moment proposal is extracted by a moment feature aggregation module MFA()MFA(\cdot). Then a moment-level 2D feature map MN×N×cM\in\mathbb{R}^{N\times N\times c} is constructed as:

Mij={MFA([Fi,Fi+1,,Fj])ij𝟎i>j\displaystyle M_{ij}=\begin{cases}MFA\left(\left[F_{i},F_{i+1},...,F_{j}\right]\right)&i\leq j\\ \mathbf{0}&i>j\end{cases} (1)

where FiF_{i} is the feature of clip CiC_{i}. Then a Temporal Adjacent Network is employed to infer moment-wise matching score between MM and the sentence feature. The moment with the largest matching score is outputted as the grounding result. In the original implementation of 2D-TAN, MFA()MFA(\cdot) is instantiated as the max-pooling operation which cannot model long-term temporal variation. Hence, the captured moment features are smoothed and the features of adjacent moments can be very similar, i.e., not discriminative. We propose a Bi-LSTM moment feature aggregation module to learn a more discriminative feature representation for each moment. And we further propose a Random Concatenation Augmentation strategy to augment the training data. We will introduce the Bi-LSTM module and the augmentation strategy in the following.

Refer to caption
Figure 2: The proposed Bi-LSTM aggregation module.

Bi-LSTM Aggregation Module. As shown in Figure 2, the Bi-LSTM Aggregation Module consists of a forward LSTM and a backward LSTM. They are both effective for modeling the temporal variation in each moment. Thus, compared to the original 2D-TAN, the moment features learned with our Bi-LSTM aggegation module are more suitable for calculating the cross-modal matching score with the sentence feature. Note that the parameters of our Bi-LSTM are shared across all moments, so we can implement the inference in parallel to improve efficiency.

Random Concatenation Augmentation. We observe severe overfitting in the original implementation of 2D-TAN and propose to randomly concatenate training samples to augment the training data. Specifically, we randomly select two videos (with similar frame rate) V,VV,V^{\prime} and their corresponding language queries Q,QQ,Q^{\prime}. We generate the augmented video as Vaug={V(t)}t=τsδ1τe+δ2||{V(t)}t=τsδ1τe+δ2V_{aug}=\{V_{(t)}\}_{t=\tau_{s}-\delta_{1}}^{\tau_{e}+\delta_{2}}||\{V^{\prime}_{(t)}\}_{t=\tau_{s}^{\prime}-\delta^{\prime}_{1}}^{\tau_{e}^{\prime}+\delta^{\prime}_{2}}, where τs\tau_{s}, τe\tau_{e} indicate the start frame and end frame index of the ground truth moment, respectively. δ1,δ2\delta_{1},\delta_{2} are random offsets and we ensure the number of frames in VaugV_{aug} is similar to that of VV. |||| represents concatenation operation along the temporal dimension. We randomly assign QQ or QQ^{\prime} as the query of VaugV_{aug}. During training, we perform this augmentation with a probability of 0.5 at each training step. The proposed augmentation strategy greatly enriches the training data and reduces the risk that the model simply learns to ground the salient segment.

1.2 Per-frame Referring for Spatial Localization

We predict the spatial localization of the target person by a referring algorithm. Here, for each frame, we use the pretrained MDETR[4] model to predict a set of bounding boxes and their corresponding grounding text (several words extracted from the input sentence). Then we use a dependency parser[7] to find the subject of the input sentence. We filter out the bounding boxes (predicted by MDETR) whose grounding text doesn’t contain the subject or contains more than one person. Finally, we select the bounding box with the longest grounding text from the remaining as our output.

2 Experiments

As shown in in Table 1, the proposed Random Concatenation Augmentation (RCA) and Bi-LSTM aggregation module both bring consistent performance improvement, which demonstrates their effectiveness. To further boost the performance, we ensemble 10 models for Augmented 2D-TAN and achieve 31.9% viou on the test set. The ensembled version achieves the best performance on HC-STVG test set[5] compared to other methods222Results are copied from http://www.picdataset.com/challenge/leaderboard/hcvg2021 as shown in Table 2.

Method viou@0.3 viou@0.5 tiou viou split
2D-TAN 47.2 16.7 53.4 29.5 val
+RCA 48.8 16.7 54.1 30.0 val
+Bi-LSTM 50.4 18.8 55.2 30.4 val
2D-TAN 47.7 20.2 53.3 29.4 test
Ours 52.7 22.2 56.5 31.9 test
Table 1: Ablation study on HC-STVG dataset[5]. \dagger: Ensemble.
Method viou@0.3 viou@0.5 tiou viou
easy_baseline2.5 48.6 24.4 66.0 30.9
try10 47.9 27.0 53.8 31.3
2stage 50.2 22.5 55.9 31.4
Ours 52.7 22.2 56.5 31.9
Table 2: Comparison results on HC-STVG test set[5].

3 Conclusion

In this report, we introduce our proposed Augmented 2D Temporal Adjacent Network (Augmented 2D-TAN) for HC-STVG task. In virtue of the proposed temporal context-aware Bi-LSTM Aggregation Module and Random Concatenation Augmentation Mechanism, Augmented 2D-TAN is powerful to capture moment features and less prone to overfit the training data. Our proposed method has achieved the state-of-the-art performance on HC-STVG test set[5].

References

  • [1] Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. A short note about kinetics-600. arXiv preprint arXiv:1808.01340, 2018.
  • [2] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 6202–6211, 2019.
  • [3] Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, et al. Ava: A video dataset of spatio-temporally localized atomic visual actions. In IEEE Conference on Computer Vision and Pattern Recognition, pages 6047–6056, 2018.
  • [4] Aishwarya Kamath, Mannat Singh, Yann LeCun, Ishan Misra, Gabriel Synnaeve, and Nicolas Carion. Mdetr–modulated detection for end-to-end multi-modal understanding. arXiv preprint arXiv:2104.12763, 2021.
  • [5] Zongheng Tang, Yue Liao, Si Liu, Guanbin Li, Xiaojie Jin, Hongxu Jiang, Qian Yu, and Dong Xu. Human-centric spatio-temporal video grounding with visual transformers. IEEE Transactions on Circuits and Systems for Video Technology, 2021.
  • [6] Songyang Zhang, Houwen Peng, Jianlong Fu, and Jiebo Luo. Learning 2d temporal adjacent networks for moment localization with natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 12870–12877, 2020.
  • [7] Yu Zhang, Zhenghua Li, and Zhang Min. Efficient second-order TreeCRF for neural dependency parsing. In Proceedings of ACL, pages 3295–3305, 2020.