Augmented 2D-TAN: A Two-stage Approach for Human-centric Spatio-Temporal Video Grounding

Chaolei Tan¹¹¹1Equal contribution. Zihang Lin¹¹¹1Equal contribution. Jian-Fang Hu¹ Xiang Li² Wei-Shi Zheng¹
¹Sun Yat-sen University China ²Meituan
{tanchlei, linzh59}@mail2.sysu.edu.cn, hujf5@mail.sysu.edu.cn
lixiang82@meituan.com, wszheng@ieee.org

Abstract

We propose an effective two-stage approach to tackle the problem of language-based Human-centric Spatio-Temporal Video Grounding (HC-STVG) task. In the first stage, we propose an Augmented 2D Temporal Adjacent Network (Augmented 2D-TAN) to temporally ground the target moment corresponding to the given description. Primarily, we improve the original 2D-TAN[6] from two aspects: First, a temporal context-aware Bi-LSTM Aggregation Module is developed to aggregate clip-level representations, replacing the original max-pooling. Second, we propose to employ Random Concatenation Augmentation (RCA) mechanism during the training phase. In the second stage, we use pretrined MDETR[4] model to generate per-frame bounding boxes via language query, and design a set of hand-crafted rules to select the best matching bounding box outputted by MDETR for each frame within the grounded moment.

1 Methodology

Human-centric spatio-temporal video grounding (HC-STVG) task [5] aims to localize a spatio-temporal tube of the target person indicated by a language description. We tackle this problem with a two-stage approach which first localizes the temporal segment and then predicts the spatial location of the target person in each frame. In the first stage, we develop a temporal grounding model named Augmented 2D-TAN which predicts the temporal boundary (i.e., the start time and end time) of the target video segment corresponding to the given description. In the second stage, we generate per-frame proposals (bounding boxes) using MDETR[4] and find the best matching bounding box for each frame in the localized segment.

1.1 Augmented 2D-TAN for Temporal Localization

We develop our Augmented 2D-TAN (as illustrated in Figure 1) grounding method based on 2D-TAN (2D Temporal Adjacent Networks)[6]. In this section, we will first briefly review 2D-TAN and its weaknesses, and then introduce how we improve it.

Refer to caption — Figure 1: Framework of Augmented 2D-TAN.

Review of 2D-TAN. 2D-TAN is a video grounding algorithm which first splits the input video into $N$ clips and extracts visual features¹¹1We use SlowFast Network[2] pretrained on Kinetics-600[[1] and AVA[3] to extract visual features. for each clip. Then dense moment (a sequence of clips) proposals $\mathcal{P}$ are generated as $\mathcal{P}=\{P_{ij}=\left[C_{i},C_{i+1},...,C_{j}\right]|1\leq i\leq j\leq N\}$ where moment $P_{ij}$ contains the $i$ -th clip $C_{i}$ to the $j$ -th clip $C_{j}$ . The feature of each moment proposal is extracted by a moment feature aggregation module $MFA(\cdot)$ . Then a moment-level 2D feature map $M\in\mathbb{R}^{N\times N\times c}$ is constructed as:

\displaystyle M_{ij}=\begin{cases}MFA\left(\left[F_{i},F_{i+1},...,F_{j}\right]\right)&i\leq j\\ \mathbf{0}&i>j\end{cases}

(1)

where $F_{i}$ is the feature of clip $C_{i}$ . Then a Temporal Adjacent Network is employed to infer moment-wise matching score between $M$ and the sentence feature. The moment with the largest matching score is outputted as the grounding result. In the original implementation of 2D-TAN, $MFA(\cdot)$ is instantiated as the max-pooling operation which cannot model long-term temporal variation. Hence, the captured moment features are smoothed and the features of adjacent moments can be very similar, i.e., not discriminative. We propose a Bi-LSTM moment feature aggregation module to learn a more discriminative feature representation for each moment. And we further propose a Random Concatenation Augmentation strategy to augment the training data. We will introduce the Bi-LSTM module and the augmentation strategy in the following.

Bi-LSTM Aggregation Module. As shown in Figure 2, the Bi-LSTM Aggregation Module consists of a forward LSTM and a backward LSTM. They are both effective for modeling the temporal variation in each moment. Thus, compared to the original 2D-TAN, the moment features learned with our Bi-LSTM aggegation module are more suitable for calculating the cross-modal matching score with the sentence feature. Note that the parameters of our Bi-LSTM are shared across all moments, so we can implement the inference in parallel to improve efficiency.

Random Concatenation Augmentation. We observe severe overfitting in the original implementation of 2D-TAN and propose to randomly concatenate training samples to augment the training data. Specifically, we randomly select two videos (with similar frame rate) $V,V^{\prime}$ and their corresponding language queries $Q,Q^{\prime}$ . We generate the augmented video as $V_{aug}=\{V_{(t)}\}_{t=\tau_{s}-\delta_{1}}^{\tau_{e}+\delta_{2}}||\{V^{\prime}_{(t)}\}_{t=\tau_{s}^{\prime}-\delta^{\prime}_{1}}^{\tau_{e}^{\prime}+\delta^{\prime}_{2}}$ , where $\tau_{s}$ , $\tau_{e}$ indicate the start frame and end frame index of the ground truth moment, respectively. $\delta_{1},\delta_{2}$ are random offsets and we ensure the number of frames in $V_{aug}$ is similar to that of $V$ . $||$ represents concatenation operation along the temporal dimension. We randomly assign $Q$ or $Q^{\prime}$ as the query of $V_{aug}$ . During training, we perform this augmentation with a probability of 0.5 at each training step. The proposed augmentation strategy greatly enriches the training data and reduces the risk that the model simply learns to ground the salient segment.

1.2 Per-frame Referring for Spatial Localization

We predict the spatial localization of the target person by a referring algorithm. Here, for each frame, we use the pretrained MDETR[4] model to predict a set of bounding boxes and their corresponding grounding text (several words extracted from the input sentence). Then we use a dependency parser[7] to find the subject of the input sentence. We filter out the bounding boxes (predicted by MDETR) whose grounding text doesn’t contain the subject or contains more than one person. Finally, we select the bounding box with the longest grounding text from the remaining as our output.

2 Experiments

As shown in in Table 1, the proposed Random Concatenation Augmentation (RCA) and Bi-LSTM aggregation module both bring consistent performance improvement, which demonstrates their effectiveness. To further boost the performance, we ensemble 10 models for Augmented 2D-TAN and achieve 31.9% viou on the test set. The ensembled version achieves the best performance on HC-STVG test set[5] compared to other methods²²2Results are copied from http://www.picdataset.com/challenge/leaderboard/hcvg2021 as shown in Table 2.

Method	viou@0.3	viou@0.5	tiou	viou	split
2D-TAN	47.2	16.7	53.4	29.5	val
+RCA	48.8	16.7	54.1	30.0	val
+Bi-LSTM	50.4	18.8	55.2	30.4	val
2D-TAN^†	47.7	20.2	53.3	29.4	test
Ours^†	52.7	22.2	56.5	31.9	test

Table 1: Ablation study on HC-STVG dataset[5].

\dagger

: Ensemble.

Method	viou@0.3	viou@0.5	tiou	viou
easy_baseline2.5	48.6	24.4	66.0	30.9
try10	47.9	27.0	53.8	31.3
2stage	50.2	22.5	55.9	31.4
Ours	52.7	22.2	56.5	31.9

Table 2: Comparison results on HC-STVG test set[5].

3 Conclusion

In this report, we introduce our proposed Augmented 2D Temporal Adjacent Network (Augmented 2D-TAN) for HC-STVG task. In virtue of the proposed temporal context-aware Bi-LSTM Aggregation Module and Random Concatenation Augmentation Mechanism, Augmented 2D-TAN is powerful to capture moment features and less prone to overfit the training data. Our proposed method has achieved the state-of-the-art performance on HC-STVG test set[5].

References

[1] Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. A short note about kinetics-600. arXiv preprint arXiv:1808.01340, 2018.
[2] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 6202–6211, 2019.
[3] Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, et al. Ava: A video dataset of spatio-temporally localized atomic visual actions. In IEEE Conference on Computer Vision and Pattern Recognition, pages 6047–6056, 2018.
[4] Aishwarya Kamath, Mannat Singh, Yann LeCun, Ishan Misra, Gabriel Synnaeve, and Nicolas Carion. Mdetr–modulated detection for end-to-end multi-modal understanding. arXiv preprint arXiv:2104.12763, 2021.
[5] Zongheng Tang, Yue Liao, Si Liu, Guanbin Li, Xiaojie Jin, Hongxu Jiang, Qian Yu, and Dong Xu. Human-centric spatio-temporal video grounding with visual transformers. IEEE Transactions on Circuits and Systems for Video Technology, 2021.
[6] Songyang Zhang, Houwen Peng, Jianlong Fu, and Jiebo Luo. Learning 2d temporal adjacent networks for moment localization with natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 12870–12877, 2020.
[7] Yu Zhang, Zhenghua Li, and Zhang Min. Efficient second-order TreeCRF for neural dependency parsing. In Proceedings of ACL, pages 3295–3305, 2020.