This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Winning the ICCV’2021 VALUE Challenge:
Task-aware Ensemble and Transfer Learning with Visual Concepts

Minchul Shin1 Jonghwan Mun1 Kyoung-Woon On1\dagger Woo-Young Kang1\dagger Gunsoo Han1\dagger Eun-Sol Kim2
1Kakao Brain    2Hanyang University
{craig.starr,jason.mun,kloud.ohn,edwin.kang,coco.han}@kakaobrain.com, eunsolkim@hanyang.ac.kr
This work was done while at Kakao Brain
Abstract

The VALUE (Video-And-Language Understanding Evaluation) benchmark is newly introduced to evaluate and analyze multi-modal representation learning algorithms on three video-and-language tasks: Retrieval, QA, and Captioning. The main objective of the VALUE challenge is to train a task-agnostic model that is simultaneously applicable for various tasks with different characteristics. This technical report describes our winning strategies for the VALUE challenge: 1) single model optimization, 2) transfer learning with visual concepts, and 3) task-aware ensemble. The first and third strategies are designed to address heterogeneous characteristics of each task, and the second one is to leverage rich and fine-grained visual information. We provide a detailed and comprehensive analysis with extensive experimental results. Based on our approach, we ranked first place on the VALUE and QA phases for the competition.

$\dagger$$\dagger$footnotetext: Equal contribution

1 Introduction

In recent years, one of the major research streams is pre-training of a foundation model (e.g., BERT [3], GPT-3 [2], and CLIP [13]) followed by transfer learning to multiple downstream tasks. Following that, pre-training multi-modal representation (e.g., MIL-NCE [11] and HERO [8]) for videos is also widely studied using large-scale datasets. However, due to the lack of a benchmark, algorithms of pre-training multi-modal representation for videos rely on different downstream tasks, making the comparison of algorithms difficult.

Motivated by this, the VALUE benchmark (or challenge) is proposed to measure the generalization ability and versatility of a multi-modal pre-trained model for videos. There are two critical characteristics of the VALUE benchmark compared to other benchmarks. First, it comprehensively measures the generalization ability of models over the popular video and language understanding tasks, including video retrieval, video question answering, and video captioning, as presented in Figure 1. These three macro-tasks are defined by eleven tasks based on 6 widely used datasets (TV [6], HowTo100M [12], YouCook2 [19], VATEX [15], VIOLIN [10], and VLEP [7]) covering a broad range of video genres, lengths and data volume. Second, videos addressed in the VALUE benchmark have multi-modal video inputs, including frames, audio, and textual information. In contrast, most of the existing works tend to focus only on visual cues. Therefore, the VALUE benchmark deals with multi-modal multi-task video data.

Refer to caption
Figure 1: Illustration of three video-and-language tasks: (a) text-based video retrieval, (b) video question answering, and (c) video captioning.
Refer to caption
Figure 2: The entire pipeline of our approach. The model is initialized with the HERO model pre-trained using TV [6] and HowTo100M [12] datasets. When the selected fine-tuning strategy is single-task (ST) only, the process of all-task (AT) fine-tuning is skipped.

In this report, we describe our three winning strategies for the VALUE challenge. The first strategy is single model optimization. Since each task has different behavior, it would be sub-optimal to use the same training configurations for the individual task. Through extensive experiments, we found the best-performing training configurations for each task, and that was to identify the best combination of visual features and fine-tuning strategies. In addition, we exploit visual concepts (objects and associated attributes) as an auxiliary source and combine them with the global clip-level visual representations. It allows our model to leverage rich and fine-grained visual information. Finally, we adopt the task-aware ensemble strategy. Since the output format of a model varies depending on the task, we apply different ensemble strategies that are specialized for each task. Through these three strategies, we ranked first place on the VALUE and QA phases.

Table 1: Optimized fine-tuning hyperparameters of HERO for individual tasks.

Hyperparameter Retrieval QA Captioning TVR How2R YC2R VATEX-EN-R TVQA How2QA VIOLIN VLEP TVC YC2C VATEX-EN-C Learning rate 1e-04 1e-04 7e-05 7e-05 5e-05 5e-05 3e-05 5e-05 1e-04 1e-04 1e-4 # GPUs 4 4 4 2 8 4 2 4 2 2 2 Batch size per GPU 16 16 24 256 2 4 16 4 8 32 64 Gradient accumulation steps 8 8 4 2 32 4 2 4 2 2 8 Effective batch size 512 512 384 1024 512 64 64 64 32 128 1024 # fine-tuning steps 5,000 3,000 4,000 4,000 10,000 2,000 6,000 1,000 7,000 7,000 7,000

2 Our Approach

In this section, we introduce our winning solution in detail. Our base model is HERO [8], and we begin with the starter code111https://github.com/VALUE-Leaderboard/StarterCode provided by the competition organizers. Based on this, we focus on the fine-tuning stage for individual tasks and improve the performance through the following three strategies: 1) single model optimization, 2) transfer learning with visual concepts, and 3) task-aware ensemble. We opt for the best model based on the validation score, assuming that the distribution of validation and test sets are similar. We illustrate the entire process of our pipeline in Figure 2.

2.1 Revisiting VALUE Challenge

The VALUE benchmark is a collection of video-and-language datasets on multi-channel videos (e.g., video and subtitle) across various video domains and genres. The benchmark contains 11 tasks (TVR, How2R, YC2R, VATEX-EN-R, TVQA, How2QA, VIOLIN, VLEP, TVC, YC2C, and VATEX-EN-C) each of which belongs to one of 3 video-and-language macro-tasks—(a) text-based video retrieval (Retrieval), (b) video question answering (QA) and (c) video captioning (Captioning)—as illustrated in Figure 1. In this competition, the raw videos are not provided due to the license issue. Instead, the competition organizers provided raw text data (i.e., subtitles) and eight different types of clip-level visual features (ResNet+SlowFast, ResNet+MIL-NCE, Clip-ViT+SlowFast, Clip-ViT+MIL-NCE, ResNet, SlowFast, Clip-ViT, MIL-NCE) extracted from different pre-trained models (ResNet [5], SlowFast [4], Clip-ViT [14], MIL-NCE [11]). Note that the plus (+) mark indicates the concatenation for multiple features.

Under the constraints, we choose the HERO model as our baseline due to the following two reasons: First, the starter code based on HERO provides detailed training configurations for each task. Second, we encounter enormous data loss when downloading raw videos from YouTube due to various reasons (e.g., broken URLs, blocked regions, etc.); it makes the pre-training of any other algorithm from raw videos intractable. HERO follows the two-stage training: (1) self-supervised learning on large-scale data, and (2) supervised fine-tuning stage on relatively smaller-scale data. Given limited resources and time, we set our objective to improve the fine-tuning stage because pre-training a single HERO model takes three weeks, even with 16 GPUs as the original paper stated. However, we believe that pre-training of HERO could be more important and influential than fine-tuning for achieving a higher score in the competition. We refer interested readers to [8] for more details about the structure of HERO and the way to apply it for different tasks.

2.2 Single Model Optimization

Assuming our model is already pre-trained, we start by optimizing training configurations to fit each of the target tasks best; the training hyper-parameters for individual tasks are summarized in Table 1. In addition, we search for the best combination of visual features and fine-tuning strategies, which will be described below.

Visual feature adaptation.

Note that as discovered by Li et al. [9], although HERO is trained using ResNet+SlowFast visual feature, fine-tuning the model with different types of visual features could improve the performance. Therefore, choosing the visual features working best for the target task becomes vital for the higher score during the fine-tuning stage. Therefore, we perform extensive experiments and identify the best-adapted feature out of eight visual features for individual tasks.

Fine-tuning strategies.

According to the baseline paper of VALUE [9], the authors explored two different ways of fine-tuning: ST and AT\rightarrowST. Specifically, ST (single-task) means that the HERO model is fine-tuned with the target task only, whereas AT\rightarrowST (all-task to single-task) performs multi-task learning over all eleven tasks followed by further fine-tuning for a single target task. We found that the best strategy is not uniform but rather different for each target task, as we describe in Table 2. Based on these observations, we exhaustively search for the best combinations for each task. The best-performing combinations for each task are shown in Table 2.

2.3 Transfer Learning with Visual Concepts

Visual concepts are often used for many vision-and-language tasks [16, 17] to complement the global-level visual cue that are obtained using 2D or 3D CNNs (e.g., ResNet [5], SlowFast [4], etc.). Inspired by this, we leverage the visual concepts during the fine-tuning stage of individual downstream tasks as an additional language source. To extract the visual concepts, we employ VinVL [18], due to its good performance, which is a detection model providing visual concepts such as objects and attributes. Given a video, we first sample a frame from individual time intervals of subtitles. Then, VinVL is applied to the sampled set of frames and provides three visual concept labels over the maximum number of 10 regions for each frame. These extracted visual concept labels are fed to the text embedding network (i.e., RoBERTa tokenizer followed by a word embedding layer with positional encoding) of HERO model after being attached to subtitles; the visual concepts of individual regions are separated using [SEP] token. The examples of extracted visual concepts are illustrated in Figure 3.

Refer to caption
Figure 3: Example of visual concepts extracted from a frame in the YouCook2 video. The extracted visual concepts are exploited as additional language sources by attached with subtitles.

2.4 Task-aware Ensemble Strategies

Since the output of a model differs depending on the task, different strategies for ensemble need to be established. For example, we can not simply average the confidence scores in the captioning task to use the simplest form of a model ensemble because the captioning model does not output the confidence score but the predicted caption itself. Therefore, we specialize the ensemble strategies for each task.

Table 2: Single and Ensemble Model Performance. The baseline column denotes the best score of VALUE baselines shown on the leaderboard. CS, M, and RS indicate visual features of Clip-ViT+SlowFast, MIL-NCE, and ResNet+SlowFast, respectively.

# Macro-Task Dataset (Task) V-Concept AT ST Baseline Ours (Best Single) Ours (Ensemble) Metrics M1 Retrieval TVR - CS CS 13.70 13.15 14.80 (+12.55%) AveR M2 How2R \checkmark - M 3.95 6.05 8.44 (+39.55%) M3 YC2R - - M 56.59 54.46 57.22 (+5.07%) M4 VATEX-EN-R \checkmark CS CS 49.91 76.18 73.54 (-3.47%) M5 QA TVQA \checkmark - RS 74.83 75.63 77.66 (+2.68%) Acc. M6 How2QA \checkmark - RS 74.60 77.11 79.74 (+3.41%) M7 VIOLIN - CS CS 67.18 69.43 70.72 (+1.86%) M8 VLEP - CS CS 69.37 68.21 69.15 (+1.38%) M9 Captioning TVC \checkmark - CS 51.04 52.30 56.17 (+7.40%) CIDEr-D M10 YC2C - - M 121.89 134.80 141.17 (+4.73%) M11 VATEX-EN-C - CS CS 58.09 56.36 60.17 (+6.76%)

Table 3: Impact of visual feature adaptation. The HERO model that is pre-trained with ResNet+SlowFast (RS) feature is fine-tuned with eight different visual features; ResNet+SlowFast (RS), ResNet+MIL-NCE (RM), Clip-ViT+SlowFast (CS), Clip-ViT+MIL-NCE (CM), ResNet (R), SlowFast (S), Clip-ViT (C), MIL-NCE (M). The best and worst numbers are highligted in blue and red respectively.

# Dataset Feature Transferability (RS \rightarrow Others) RS RM CS CM R S C M P1 TVR 11.77 8.75 (-3.02) 12.48 (+0.71) 11.15 8.90 10.83 11.29 9.89 P2 How2R 5.74 5.46 5.84 (+0.11) 5.64 5.07 (-0.67) 5.20 5.69 5.82 P3 YC2R 49.87 48.22 48.92 49.35 47.92 47.60 (-2.27) 48.52 54.46 (+4.59) P4 VATEX-EN-R 61.46 46.21 66.96 (+5.50) 64.29 45.57 (-15.89) 58.76 63.72 51.22 P5 TVQA 74.39 72.10 (-2.29) 74.32 (-0.07) 72.23 73.30 73.71 73.19 73.19 P6 How2QA 74.74 72.52 (-2.22) 74.03 (-0.71) 73.03 73.13 73.52 73.16 73.48 P7 VIOLIN 67.68 67.26 67.67 67.60 67.00 67.37 67.67 (-0.01) 66.78 (-0.68) P8 VLEP 67.24 65.35 (-1.89) 66.10 65.76 66.07 66.51 66.58 (-0.66) 66.05 P9 TVC 49.86 46.19 (-3.67) 51.24 50.62 47.22 48.96 51.34 (+1.48) 48.93 P10 YC2C 121.80 112.20 (-9.60) 115.00 118.30 112.80 113.00 121.00 134.80 (+13.00) P11 VATEX-EN-C 51.39 38.68 55.42 (+4.03) 50.85 37.61 (-13.78) 49.70 50.93 41.36  

Bayesian optimization for retrieval.

Given NrN_{r} retrieval models, we obtain a list of similarity score matrices 222For Video Corpus Moment Retrieval (VCMR), we apply non-maximum suppression (NMS) with IoU threshold 0.7 to retrieve max. 100 candidates. 𝒮ret={𝑺1,𝑺2,,𝑺i,,𝑺Nr}\mathcal{S}_{\text{ret}}=\{\bm{S}_{1},\bm{S}_{2},...,\bm{S}_{i},...,\bm{S}_{N_{r}}\} where 𝑺𝒊Nq×Ng\bm{S_{i}}\in\mathbb{R}^{N_{q}\times N_{g}} means a similarity matrix between the NqN_{q} text queries and NgN_{g} candidate videos for the ithi^{\text{th}} retrieval model. Then, we find a set of weights 𝒲ret={w1,w2,,wNr}\mathcal{W}_{\text{ret}}=\{w_{1},w_{2},...,w_{N_{r}}\} to be used for identifying the best combination of retrieval models. Finally, the ensembled score matrix 𝑺ENS\bm{S}_{\text{ENS}} is given by

𝑺ENS=w1𝑺1+w2𝑺2++wNr𝑺Nr,\bm{S}_{\text{ENS}}=w_{1}\bm{S}_{1}+w_{2}\bm{S}_{2}+...+w_{N_{r}}\bm{S}_{N_{r}}, (1)

where nwn\sum_{n}w_{n} = 1. To find the optimal values for 𝒲ret\mathcal{W}_{\text{ret}}, we resort to hyper-parameter tuning by Bayesian optimization333https://github.com/hyperopt/hyperopt. To be specific, given that 𝒮ret\mathcal{S}_{\text{ret}} is obtained by predictions of NrN_{r} models, we set a NrN_{r}-dimensional uniformly distributed search space. Then, the weights are randomly sampled and fed into Tree-structured Parzen Estimator algorithm [1] to determine the optimal values for 𝒲ret\mathcal{W}_{\text{ret}} based on the objective function. We leverage the mean recall (i.e., (R@1+R@5+R@10)/3) as the objective function. We iterate this optimization process over 300 steps.

Training single-layered FC for QA.

Given NqaN_{\text{qa}} QA models, our objective is to find the best set of weights 𝒲qa={w1,w2,,wNqa}\mathcal{W}_{\text{qa}}=\{w_{1},w_{2},...,w_{N_{\text{qa}}}\}. Basically, QA task in VALUE is identical to a simple classification that predicts an answer label c{c1,c2,,cNans}c\sim\{c_{1},c_{2},...,c_{N_{\text{ans}}}\}, where NansN_{\text{ans}} is the number of candidate answers. In this formulation, the QA models output the confidence scores 𝒮qa={s1,s2,,si,sNqa}\mathcal{S}_{\text{qa}}=\{s_{1},s_{2},...,s_{i}...,s_{N_{\text{qa}}}\}, where sim×Nanss_{i}\in\mathbb{R}^{m\times N_{\text{ans}}} is the score matrix representing the confidence scores for each class outputted by ii-th model given mm examples of test set. Instead of using Bayesian optimization, we use a learning-based approach for QA. We convert the problem to find the optimal 𝒲qa\mathcal{W}_{\text{qa}} to learning a single-layered Fully-connected layer (FC) with no bias. Firstly, we collect all model outputs 𝒮qa\mathcal{S}_{\text{qa}} and batchify it to build an input XB×Nqa×NansX\in\mathbb{R}^{B\times N_{\text{qa}}\times N_{\text{ans}}}, where BB is the batch size. Then we train a linear layer (ϕ\phi; nn.Linear(NqaN_{\text{qa}}, 1), bias=False)) to obtain output X¯=ϕ(X),X¯B×Nans\bar{X}=\phi(X),\bar{X}\in\mathbb{R}^{B\times N_{\text{ans}}}. Lastly, we apply the cross-entropy loss with ground-truth label. The strength of the learning-based approach is that it converges fast regardless of how large NqaN_{\text{qa}} is used. On the other hand, the disadvantage is evident in that it requires careful hyperparameter tuning tor training.

Table 4: Analysis on the effect of visual concepts.

# Macro-Task Domain Dataset (Task) Val. Δ\Delta (%) by dataset Avg. Δ\Delta (%) by macro-task Avg. Δ\Delta (%) by domain G1 TV TVR 11.77 \rightarrow 11.69 -0.68 -0.51 +0.49 G2 HowTo100M How2R 5.73 \rightarrow 5.61 -2.09 -0.51 +0.54 G3 YouCook2 YC2R 49.87 \rightarrow 49.18 -1.38 -0.51 -1.47 G4 Retrieval VATEX VATEX-EN-R 61.46 \rightarrow 62.75 +2.10 -0.51 +2.10 G5 TV TVQA 74.39 \rightarrow 75.63 +1.67 +0.89 +0.49 G6 HowTo100M How2QA 74.74 \rightarrow 77.11 +3.17 +0.89 +0.54 G7 VIOLIN VIOLIN 67.60 \rightarrow 67.37 -0.34 +0.89 -0.34 G8 QA VLEP VLEP 67.24 \rightarrow 66.62 -0.92 +0.89 -0.92 G9 TV TVC 49.86 \rightarrow 50.10 +0.48 +0.34 +0.49 G10 YouCook2 YC2C 121.90 \rightarrow 120.00 -1.56 +0.34 -1.47 G11 Captioning VATEX VATEX-EN-C 51.39 \rightarrow 52.47 +2.10 +0.34 +2.10

Table 5: Leaderboard Score Transition at each trials. The asterisk mark (*) indicates the prediction results for submission were mistakenly calculated due to internal bug in our code.

# Ensemble V-Concept AT ST Retrieval QA Captioning Meta-Ave T1 34.89 67.38 35.29 60.63 T2 35.29 72.21 85.95 62.53 T3 35.05 72.89 85.95 62.69 T4 35.02 73.01 85.95 62.72

Consensus-based ranking for captioning.

Given NcN_{c} captioning models, we generate a set of captions 𝒞Nc={c1,c2,,cNc}\mathcal{C}_{N_{c}}=\{c_{1},c_{2},...,c_{N_{c}}\} for an input video where individual models generate a caption with a greedy decoding strategy. Then, we adopt a consensus score sns_{n} for a caption cnc_{n} from nthn^{\text{th}} captioning model, which is defined as an averaged similarity to all the other captions as follows:

sn=1|𝒞Nc|1c𝒞Nc\{cn}sim(cn,c),s_{n}=\frac{1}{|\mathcal{C}_{N_{c}}|-1}\sum_{c^{\prime}\in{\mathcal{C}_{N_{c}}}\backslash\{c_{n}\}}\text{sim}(c_{n},c^{\prime}), (2)

where sim(cn,c)\text{sim}(c_{n},c^{\prime}) is the similarity between two captions cnc_{n} and cc^{\prime}. We employ five sentence embedding models444https://github.com/UKPLab/sentence-transformers555More specifically, we adopt following 5 models—paraphrase-mpnet-base-v2, stsb-mpnet-base-v2, paraphrase-MiniLM-L3-v2, paraphrase-multilingual-mpnet-base-v2, and paraphrase-TinyBERT-L6-v2. and the averaged cosine similarities of individual embeddings for captions as the similarity function. Finally, the caption of the highest consensus score is chosen for our final output.

3 Experiment

This section discusses the impact of the single model optimization, fine-tuning with visual concepts on individual tasks, and the task-aware ensemble strategies.

3.1 Single Model Performance

As discussed in Section 2.2, we optimize a single model training configuration for 11 tasks individually. The best configurations (i.e., usage of visual concept, application of AT and ST, and the choice of visual feature) and the corresponding best single model performance are summarized in Table 2. We can observe the followings. Firstly, our optimized models outperform the counterparts of VALUE baselines by large margins in most cases, while some of them show slightly lower scores (see M1, M3, M8, and M11). Secondly, the best combinations of (1) visual feature adaptation during the fine-tuning stage and (2) fine-tuning scheme (i.e., ST or AT\rightarrowST) are highly dependent on both domain and task. In general, using Clip-ViT+SlowFast (CS) visual feature with the AT\rightarrowST scheme shows outstanding results. On the other hand, in some cases, a specific type of feature significantly outperforms the others. For example, we found that fine-tuning on the single task with MIL-NCE feature performs best for the YouCook2 dataset (see M3 and M10); it is expected because the videos in YouCook2 and HowTo100M (used for pre-training MIL-NCE) share the domain, implying the importance of in-domain pre-training. Lastly, the task-aware model ensemble further provides performance gain in all tasks except VATEX-EN-R in Retrieval. In addition, Table 3 shows the impact of visual feature adaptation where it provides the performance gain in Retrieval and Captioning macro-tasks (see P1-4 and P9-11). The results from the two tables indicate the importance of configuration optimization for individual tasks.

3.2 Effect of Visual Concepts

As illustrated in Figure 3 and Section 2.3, we extract the visual concepts from raw frames and leverage them with the corresponding subtitles as auxiliary information. Since we have a lot of missing videos that failed to download, a limited portion of videos in each dataset domain (i.e., YouCook2 (89%), VIOLIN (15%), VATEX (76%), HowTo100M (87%), and TV  (99%)) is used for the extraction. For training samples from the missing videos, we use the subtitle only without the visual concepts. Table 4 analyzes the effect of using the visual concepts during the fine-tuning. We observe the performance enhancement on five out of eleven tasks (see G4-6, G9, and G11) compared to without employing the visual concepts. We found that fine-tuning with the visual concepts is especially effective for VATEX videos (both retrieval and captioning; G4 and G11) as well as for QA macro-tasks (G5-8). On the other hand, fine-tuning with visual concepts did not improve the performance on the retrieval task. Although the effectiveness of visual concepts depends on many experimental variables (e.g., domains, tasks, # of visual concepts, etc.), it helps to increase the variance between models, which is known to be an essential factor for the compelling model ensemble.

3.3 Model Ensemble

Our submission scores in Table 5 are obtained by model ensemble. Given fine-tuned models trained with various training configurations, we choose the top KK models sorted by the validation scores for the ensemble. Note that we vary the hyperparameter KK according to the macro-task because we found a trade-off between computational cost and the performance for the ensemble. We set KK to 8, 16, and 32 for captioning, QA, and retrieval macro-tasks, respectively. Table 2 shows the validation score of our ensemble model. In most cases, we achieve significant score improvement with a model ensemble by large margins (max. +39.55% and avg. +7.45%). The retrieval task benefits the most from the ensemble, followed by captioning and QA.

4 Conclusion

We described our winning strategies for the VALUE challenge 2021. To solve the task, we propose three main key ingredients: 1) single model optimization, 2) transfer learning with visual concepts, and 3) task-aware ensemble. Through the proposed strategies, the score is improved step by step as shown through extensive experiments in this report. Our final submission is an ensemble of the KK number of best performing single models on the validation set, that are trained with various training configurations. Based on our approach, we achieve 1st place on the VALUE and QA phases for the competition.

References

  • [1] James S Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyper-parameter optimization. In NeurIPS, 2011.
  • [2] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
  • [3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • [4] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In ICCV, 2019.
  • [5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [6] Jie Lei, Licheng Yu, Mohit Bansal, and Tamara L Berg. Tvqa: Localized, compositional video question answering. arXiv preprint arXiv:1809.01696, 2018.
  • [7] Jie Lei, Licheng Yu, Tamara L Berg, and Mohit Bansal. What is more likely to happen next? video-and-language future event prediction. arXiv preprint arXiv:2010.07999, 2020.
  • [8] Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. Hero: Hierarchical encoder for video+ language omni-representation pre-training. arXiv preprint arXiv:2005.00200, 2020.
  • [9] Linjie Li, Jie Lei, Zhe Gan, Licheng Yu, Yen-Chun Chen, Rohit Pillai, Yu Cheng, Luowei Zhou, Xin Eric Wang, William Yang Wang, et al. Value: A multi-task benchmark for video-and-language understanding evaluation. In 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks, 2021.
  • [10] Jingzhou Liu, Wenhu Chen, Yu Cheng, Zhe Gan, Licheng Yu, Yiming Yang, and Jingjing Liu. Violin: A large-scale dataset for video-and-language inference. In CVPR, 2020.
  • [11] Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. End-to-end learning of visual representations from uncurated instructional videos. In CVPR, 2020.
  • [12] Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV, 2019.
  • [13] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021.
  • [14] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021.
  • [15] Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In ICCV, 2019.
  • [16] Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. Image captioning with semantic attention. In CVPR, 2016.
  • [17] Youngjae Yu, Hyungjin Ko, Jongwook Choi, and Gunhee Kim. End-to-end concept word detection for video captioning, retrieval, and question answering. In CVPR, 2017.
  • [18] Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. Vinvl: Revisiting visual representations in vision-language models. In CVPR, 2021.
  • [19] Luowei Zhou, Chenliang Xu, and Jason J Corso. Towards automatic learning of procedures from web instructional videos. In AAAI, 2018.