¹¹institutetext: Columbia University ²²institutetext: Microsoft Research ³³institutetext: University of California, Los Angeles

Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks

Zhecan Wang Equal Contribution.11 Noel Codella^∗ 22 Yen-Chun Chen 22 Luowei Zhou 22 Jianwei Yang 22 Xiyang Dai 22 Bin Xiao 22 Haoxuan You 11 Kai-wei Chang 33 Shih-Fu Chang 11 Lu Yuan 22

Abstract

Abstract. Cross-modal encoders for vision-language (VL) tasks are often pretrained with carefully curated vision-language datasets. While these datasets reach an order of 10 million samples, the labor cost is prohibitive to scale further. Conversely, unimodal encoders are pretrained with simpler annotations that are less cost-prohibitive, achieving scales of hundreds of millions to billions. As a result, unimodal encoders have achieved state-of-art (SOTA) on many downstream tasks. However, challenges remain when applying to VL tasks. The pretraining data is not optimal for cross-modal architectures and requires heavy computational resources. In addition, unimodal architectures lack cross-modal interactions that have demonstrated significant benefits for VL tasks. Therefore, how to best leverage pretrained unimodal encoders for VL tasks is still an area of active research. In this work, we propose a method to leverage unimodal vision and text encoders for VL tasks that augment existing VL approaches while conserving computational complexity. Specifically, we propose Multimodal Adaptive Distillation (MAD), which adaptively distills useful knowledge from pretrained encoders to cross-modal VL encoders. Second, to better capture nuanced impacts on VL task performance, we introduce an evaluation protocol that includes Visual Commonsense Reasoning (VCR), Visual Entailment (SNLI-VE), and Visual Question Answering (VQA), across a variety of data constraints and conditions of domain shift. Experiments demonstrate that MAD leads to consistent gains in the low-shot, domain-shifted, and fully-supervised conditions on VCR, SNLI-VE, and VQA, achieving SOTA performance on VCR compared to other single models pretrained with image-text data. Finally, MAD outperforms concurrent works utilizing pretrained vision encoder from CLIP. Code will be made available.

1 Introduction

Refer to caption — Figure 1: Comparison of data size between different pretraining frameworks. It is noticeable that the data size of pretraining unimodal encoder models is generally much larger than Vision-Language models.

Visual Commonsense Reasoning (VCR) [63], Stanford Natural Language Inference Visual Entailment (SNLI-VE) [58], and Visual Question Answering (VQA) [2], are all VL tasks requiring solutions that effectively bridge the two modalities, with mechanisms of logic and prior knowledge serving as the link between them. For example, in VCR, logical reasoning capability and common knowledge are useful in answering questions about images, especially when the visual information is ambiguous, rendering the task of connecting questions with their correct answers difficult. Considering this, most recent works that have achieved SOTA tend to capture the logic and prior knowledge with various cross-modal transformer architectures that jointly model both modalities in a unified architecture [9, 47, 17, 20, 21, 52, 56]. These models are typically pretrained on large images-caption dataset [43, 30]. These curated datasets are all collected carefully by human annotators to ensure the text data consists of visually descriptive and syntactically correct sentences. They typically do not exceed about 10M data points in size.

By contrast, obtaining training data for unimodal models requires significantly less effort and manual labor, although the data tends to have large domain gaps against downstream VL tasks. For example, attention-based text models like BERT [12], GPT [37], RoBERTa [31], BEiT [3], etc. can conveniently leverage self-supervision or weak-supervision to conduct unimodal pretraining on large text corpus [43, 30]. Vision models like ViT [13] can also be obtained via unimodal supervised pretraining on large image datasets such as JFT-300M [48]. Furthermore, another line of research has been examining pretraining with paired noisy image-text data (crawled from the web) which is also easier to collect than image-caption pairs. This pretraining is typically combined with a contrastive learning objective to align both modalities into a shared embedding space. In these frameworks, vision, and text streams are all modality-specific encoders and cross-modal fusion layers do not exist but shallow connections via cosine distance. Thus different from the conventional VL frameworks and similar to unimodal pretraining, those contrastive pretraining can also produce pretrained unimodal encoder models like CLIP-V, CLIP-T, etc. These works extend pretraining datasets to a more massive scale, such as CLIP [36] with 400 million, ALIGN [22] with 10 billion and BASIC [35] with 100 billion (the largest so far).

These pretrained unimodal encoder models are generally pretrained with larger data than pretrained VL models, as shown in Fig. 1¹¹1The unit is sample data point. One data point can represent an image in visual pretraining and phrases, sentences in text pretraining.. Thus they can produce generalized feature representations and can benefit a wide range of downstream tasks. For instance, CLIP-V and CLIP-T show strong performance for zero-shot classification and object detection tasks [67]. Due to their complementary model architecture and massive scale pretraining, unimodal encoder models present the potential to benefit VL tasks. However, whether and how to leverage pretrained vision/text encoder models for VL tasks is still an area of active research investigation. Previous methods extensively experimented with directly plugging in pretrained text encoders into VL frameworks [57, 9, 47, 17, 20, 21, 52, 56]. Nevertheless, those frameworks all require redoing pretraining steps to align the representations. Another recent method [44] adapts the pretrained vision model from CLIP [36] but also has to undergo an additional stage of pretraining on millions of data for adaptation. This approach not only increases the computational complexity but also require to have sufficient additional data for adaptation. This is particularly impractical in real-world scenarios with limited data availability and may further vulnerability to domain shift in target task data.

Current research is continuously pushing the performance of pretrained unimodal encoder models via increasing pretraining data size. Therefore, a natural research question rises. If given the best-pretrained vision and text encoders, what is the efficient solution to integrate them into a pretrained VL model without redoing pretraining steps and impacting inference complexity?

In this work, we propose a flexible approach to leverage pretrained vision and text models for VL tasks relying only on the task finetuning step. Specifically, we propose our main approach referred to as Multimodal Adaptive Distillation (MAD), which adaptively distills knowledge from pretrained unimodal models to VL task-specific cross-modal architectures per data instance, changing both distillation weights as well as distillation targets, depending on the behavior of the unimodal encoders and VL task-specific cross-modal architecture on the data point. Our framework, first time, proposes multimodal modularized distillation allowing teacher unimodal encoders to come from different frameworks. Our method does not require computationally expensive pretraining steps and maintains the inference complexity of the task model. For evaluation, we propose a new protocol involving multiple VL tasks (VCR, SNLI-VE, and VQA) where evaluation is conducted under a variety of data availability constraints, including zero-shot, low-shot, and fully-supervised settings. For VCR, we additionally use established ways of perturbing the characteristics of the evaluation set to measure model performance under domain-shift. We compare ours with a broad spectrum of baselines fusing pretrained vision and text models and demonstrate superior performance, achieving state-of-art on VCR Q2A for single models pretrained with image-text data, and competitive results on SNLI-VE and VQA.

2 Related Work

Vision and Language Pretraining: Since the success of text pretraining with attention-based models, such as BERT [12], GPT [37], etc., those pretrained text encoder models have been broadly employed in different tasks. VL models, such as LXMERT [52], VL-BERT [47], UNITER [9], VILLA [17], and others [20, 21, 68, 25, 62, 28] followed to integrate the pretrained text models into their frameworks. Thus they have to conduct additional VL pretraining (VLP) with image-caption data. All of these pretraining steps utilize additional datasets on the order of 10M samples for various pretraining objectives, including Masked Language Modeling (MLM), Image-Text Matching (ITM), etc. Similarly large-scale pretrained vision models can be obtained via vision pretraining frameworks such as ViT [13] on image dataset including JFT-300M [48]. Trying to utilize large-scale pretrained vision models, a recent work, CLIP-ViL experiments to [44] adapt the vision encoder from CLIP. Similarly, this also unavoidably results in redoing all the VLP steps. Both pretrained vision and text models can also be obtained via contrastive pretraining frameworks which reserve modality-independence [36, 22, 35]. These contrastive learning frameworks can also be regarded as a shallow method to fuse pretrained vision and text encoders for VL tasks. They heavily rely on cosine distance between output features of text and vision encoder models. However, without incorporating pretrained vision, text models into a pretrained VL model but a shallow cosine measure, these frameworks have not proved to produce a high performance on highly semantic VL tasks like VCR, etc.

Currently, there is a lack of a more general framework for fusing large-scale pretrained vision and text models into a pretrained VL structure for improving downstream VL tasks while controlling computational complexity. Existing fusion methods, such as shallow contrastive frameworks, that avoid impacting computation complexity mostly focus on traditional VL tasks such as object classification, image captioning, etc. To our best knowledge, no prior VLP works from this tier have studied the impact on generalization capability by assessing performance on highly-semantic tasks, such as VCR, especially under both low-shot and domain shifted scenarios, which are more reflective of the challenges encountered in practice.

In this work, we address several of these mentioned gaps. First, we propose an efficient distillation approach to leverage pretrained vision and text encoder models into a pretrained VL structure in a way that doesn’t require additional pretraining, and retains inference complexity. Second, we study the impact of utilizing large-scale pretrained vision, text encoders like CLIP-V, CLIP-T, Roberta, ViT, etc. on VL tasks including VCR, SNLI-VE, and VQA under true zero-shot, low-shot, and domain shifted scenarios.

Generalization of Question Answering: Visual question answering [63, 2] is among the most challenging VL tasks, due to large variation between samples and distribution discrepancy between training and testing datasets. This results in difficulty for models to perform well under zero-shot, low-shot, and domain-shifted settings. Despite high accuracy recent models achieve [46, 17, 29, 66, 64], other works have begun to uncover the tendency of models to leverage spurious shortcut signals, or memorization of mapping distributions [41, 23, 11, 27]. While prior works have begun to explore question answering tasks under zero-shot [53, 34], low-shot [7, 4] and domain shifted [11, 23, 18, 65, 1, 42, 38] settings, they are mostly limited within text-only question answering and low-level visual question answering tasks (i.e. VQA [2]). To the best of our knowledge, none of the prior works have explored true zero-shot and low-shot settings in highly complex visual question answering datasets such as VCR or SNLI-VE, and only one prior work [61] contributed to VCR with domain shift.

In this work, we conduct a thorough evaluation with many existing top-performing models on zero-shot, low-shot, and domain-shifted settings.

Knowledge Distillation: In conventional knowledge distillation [32, 60, 55, 33], a larger model serves as the teacher to a smaller student model. The goal is usually to obtain a computationally lighter and more efficient framework but still maintain similar or even higher accuracy [19, 10, 40, 24, 26, 49]. However, recently many frameworks with small model complexity may have large pretraining data which results in more generalized feature representations. Crucial values still exist for distillation from those large-scale pretrained models to downstream domain-specific models with potentially larger model complexity. It is shallow to only compare the inference complexity between them ignoring the large pretraining complexity.

Also, tremendous progress was made in knowledge distillation with unimodal data. For instance, in vision, [54, 15] propose distillation for visual representation learning. In addition, remarkable advances have been made in knowledge distillation for language model compression [40, 24, 51]. A wide range of works have explored to supervise the student model by mimicking different components of the teacher, e.g. the distribution of self-attention, intermediate representations of transformer blocks, last layer features, etc. to increase performances [59, 50, 8]. On the other hand, only one prior explored knowledge distillation in VL transformers [14] while it mainly follows the conventional methods with unimodal distillation. The distillation is not differentiated between vision and language.

Further, in most of these scenarios, the feature components leveraged for the distillation loss is fixed, and the weight of the distillation loss is also fixed. Prior works related to attention structures only focus on fixed sequence-to-sequence distillation [49] and only one prior work first introduced the idea of dynamically adjusting the distillation loss weight on a per-instance level [16].

In this work, our framework, for the first time, proposes multimodal distillation which modulates the distillation between modalities enabling flexible switch of different teacher models for each modality. Our approach is also model-agnostic ignoring the inference complexity comparison between the teacher and the student. The experiments prove that it is effective in all of these scenarios. We also leverage the idea of dynamic distillation weights, but we even go one step further: we dynamically adjust not only the distillation weight for each sample instance but which token features are distilled as well within every sequence.

3 Methods

In this work, we explore many methods to utilize large-scale pretrained unimodal encoder models to help downstream VL tasks, including direct finetuning, adding adapters on unimodal models, and several proposed forms of distillation. Among these, we find that our proposed knowledge distillation is the most effective, achieving the best performance without having to redo costly pretraining, and without impacting inference complexity. We introduce these methods in the following order: First, a naive approach, Multimodal Distillation (MD); Then, the improved Multimodal Adaptive Distillation (MAD).

3.1 Multimodal Distillation

Following [40, 24, 26, 49], we can utilize token features form both vision and language to represent the corresponding modality sequence information. This allows our method to conduct sequence-to-sequence distillation.

Both the teacher and student model’s visual $img$ ²²2visual $img$ token refer to the first token in visual sequence tokens, in addition to the teacher’s text $eos$ token and the student’s text $cls$ token, are compared via L1 measure. The final loss is the weighted distillation loss $L_{d}$ summed with the original task loss $L_{t}$ for any specific downstream task. Formally,

\displaystyle L_{final-MD}

\displaystyle=L_{t}+w\cdot\left(L_{d}\right)

(1)

where the distillation loss $L_{d}$ is a sum of the distillation losses between the two modalities:

\displaystyle L_{d}

\displaystyle=L_{d,v}+L_{d,t}

(2)

where $v$ and $t$ refer to vision and text branches, respectively.

\displaystyle L_{d,v}

\displaystyle=\left\|f_{t,v,img}\left(i_{j}\right)-f_{s,v,img}\left(\partial\left(i_{j}\right)\right)\right\|_{1}=\left\|V_{t}-V_{s}\right\|_{1}

(3)

where $f_{t,v,img}$ and $f_{s,v,img}$ refer to the feature extraction regarding $img$ token of the teacher vision encoder, and student model. $V_{t}$ and $V_{s}$ represent the extracted visual token features from the teacher and student models. $\partial$ represents the backbone detection network. $i_{j}$ refers to the image input for instance $j$ . For the text modality, we have

\displaystyle L_{d,t}

\displaystyle=\left\|f_{t,t,eos}\left(t_{j}\right)-f_{s,t,cls}\left(t_{j}\right)\right\|_{1}=\left\|T_{t}-T_{s}\right\|_{1}

(4)

where $f_{t,t,eos}$ refers to the feature extraction of $eos$ token from the teacher text model, and $f_{s,t,cls}$ refers to the feature extraction of $cls$ token from the student model. $T_{t}$ and $T_{s}$ represent the extracted text token features from the teacher and student models. $t_{j}$ refers to the text input for instance $j$ .

3.2 Multimodal Adaptive Distillation (MAD)

Improved on top of MD, MAD consists of the following 3 components: Token Selection using an unsupervised language prior, Confidence Weighted Distillation, and Adaptive Finetuning with Distilled Knowledge.

Token Selection: Typically, distillation methods [40, 24, 26, 49] distill knowledge from the teacher transformer to the student transformer in a sequence-to-sequence fashion by using fixed token components of the networks. However, the most semantically relevant tokens can change per instance. Without a dynamic process to determine which components of a network should be distilled, the student model may have a higher risk of learning spurious signals from trivial tokens. Addressing this, we present a hybrid method of performing token selection for distillation.

Given a text sequence $t_{j}=\left\{w_{0}\ldots w_{z}\right\}$ , where $z$ is the length of the sequence, we apply a pretrained Token Selection Module (TSM) $f_{tsm}\left(i_{j},t_{j}\right)$ to discriminate the semantically meaningful tokens. TSM generates a score, $s_{l}$ for each token $w_{l}$ , obtaining a distribution $S_{j}=\left\{s_{0}\ldots s_{z}\right\}=f_{tsm}\left(i_{j},\left\{w_{0}\ldots w_{z}\right\}\right)$ . In our implementation of this selective module, two sets of weights are computed via two different approaches, and summed together:

\begin{gathered}S_{j}=\frac{S_{vr}}{|S_{vr}|_{1}}+\frac{S_{si}}{|S_{si}|_{1}}\\ \end{gathered}

(5)

where $S_{vr}=\left\{s_{vr0}\ldots s_{vrz}\right\}=\left\{cos\left<f_{t,v}\left(i_{j}\right),f_{s,t}\left(w_{l}\right)\right>\right\}$ represents a scoring between the visual representation and text token representation. In this manner, $S_{vr}$ is the score measuring the visual relevance of each token $w_{l},l\in[0,\ldots,z]$ . Then, $S_{si}=f_{ke}\left(t_{j}\right)$ , where $S_{si}$ represents the semantic and syntactic importance of each token related to the context, $t_{j}$ , using a keyword extractor. In practice, we apply a pre-trained keyword extractor[5], $f_{ke}$ with $n$ -grams:

We then rank text tokens based on $S_{j}$ and select $m$ tokens with the highest scores, $t_{j}^{\prime}$ , where $\left|t_{j}{}^{\prime}\right|=m$ . The corresponding features of both teacher and student model would be compared with an L1 measure to calculate their difference:

L_{dt}{}^{\prime}=\left\|f_{t,t}\left(t_{j}{}^{\prime}\right)-f_{s,t}\left(t_{j}{}^{\prime}\right)\right\|_{1}

(6)

finally, the outputted loss, $L_{dt}{}^{\prime}$ would be added to the final loss, $L_{\text{final-MAD }}$ by the proportionally updated distillation weight, $w^{\prime}$ :

L_{\text{final-MAD}}=L_{t}+w^{\prime}\cdot\left(L_{d,v}+L_{d,t}+L_{dt}^{\prime}\right)=L_{t}+w^{\prime}\cdot\left(L_{d}+L_{dt}^{\prime}\right)

(7)

Confidence Weighting (CW): CLIP [36] benefits from a large amount of paired language-image training data. While this broad prior knowledge is likely helpful for VL tasks, the degree to which this knowledge is either helpful or potentially hurtful likely changes on an instance level. Given this, we design an approach to toggle the distillation objective depending on the relative confidence of the CLIP teacher and the specific student architecture. To do this, we define the ratio $r$ between maximum confidence scores of the CLIP teacher and base student model:

r=\frac{f_{argmax}\left(\sigma\left(L_{j}^{c}\right)\right)}{f_{argmax}\left(\sigma\left(L_{j}^{b}\right)\right)}

(8)

where $L_{j}^{c}$ represents the logit vector from CLIP and $L_{j}^{b}$ from the base model. Finally, the new adaptive weight $w_{r}$ is defined as:

w_{r}=\begin{cases}\text{0, }&\text{if r }\leq 1\\ w^{\prime}\text{, }&\text{if r }>1\end{cases}

(9)

where $w_{r}$ replaces the distillation weights in Eq. 1 and 7. When the ratio is above 1, CLIP is confident with its prediction, thus the distillation weight $w^{\prime}$ would be applied accordingly. Otherwise, the distillation value would be set to 0 to prevent from CLIP’s interference.

Adaptive Finetuning (AF) : After large-scale pretraining, previous work [9] chooses to first train V+L models with a set of auxiliary tasks to better integrate the modalities, including Masked Language Modeling (MLM), Image-Text Matching (ITM), etc. on downstream datasets. Then the training would finally be followed by the direct finetuning with the downstream target task only. The corresponding loss of this set of auxiliary tasks can be denoted as $L_{\text{adapt}}$ (For details of $L_{\text{adapt}}$ , please refer to [9]). Inspired by this, we further propose a two-stage finetuning strategy. With our strategy, we first finetune the base model with $L_{\text{ AF }}$ on the full downstream data then we conduct the last-step finetuning with $L_{\text{final-MAD}}$ . Thus during the Adaptive Finetuning, multimodal distilled knowledge can help align the vision and language modality features beforehand.

\text{Full Training Steps}=\begin{cases}\text{1st step, }&\text{finetune with }L_{\text{AF}}\\ \text{2nd step, }&\text{finetune with }L_{\text{final-MAD}}\end{cases}

(10)

3.3 Teacher Models

We utilize the popular pretrained visual and text encoders like CLIP-V, CLIP-T, ViT and RoBERTa. Our distillation framework is a generalized modularized multimodal distillation such that the teacher model of each modality can be different. This leads to forming two essential combinations in our experiments: pretrained visual and text encoders come from the same pretraining framework and vice versa.

3.4 Student Models

Several recent top-performing pretrained VL models for highly-semantic VL tasks are selected as students, including UNITER [9], VL-BERT [47], and VILLA [17]. All of these models represent variations of multi-modal architectures, using different portfolios of objective functions and image-caption datasets for pretraining.

4 Datasets and Evaluations

In order to evaluate our methods and demonstrate their supremacy. We propose this evaluation protocol under a variety of data availability constraints, including zero-shot, low-shot, domain-shift, and fully-supervised settings. We evaluate our methods on 3 commonly used highly-semantic VL benchmarks: VCR, SNLI-VE, and VQA.

4.1 Visual Commonsense Reasoning (VCR)

The VCR benchmark presents images along with a paired question, a set of candidate answers, and a set of candidate rationales [63]. The dataset includes 290k questions, in reference to 110k unique visual scenes. The questions are constituted into 7 categories based on patterns in the questions. Please see the supplementary material for a full list.

Zero-Shot: No training data is used. The pretrained model is directly employed to produce a matching between the image-question pair and a candidate answer. Answers are selected based on which produce the best matches, according to the model’s matching measure.

Low-Shot: In the low-shot setting, we have 2 training set partitions of varying sizes. Since VCR has 7 types of questions thus (1) we select 100 examples per question category, totalling 700 pairs, or 0.3% of the entire dataset, and (2) 1,000 examples per category, totalling 7,000 pairs, or 3%. Each experiment is run more than 4 times.

Standard Evaluation: In this evaluation setting we follow the standard protocol in the benchmark.

Shortcut Mitigated (SM): We include a prior evaluation configuration [61] that focuses on mitigating shortcuts between question and answers. Shortcuts are shallow signals models can learn to recognize. Learning shortcuts may allow models to link questions to correct answers without deep understanding of the content. We refer to this as “Shortcut Mitigated” (SM).

4.2 Visual Entailment (SNLI-VE)

The Stanford Natural Language Inference Visual Entailment (SNLI-VE) task [58] presents images as a premise, with paired hypothesis test. The goal is to predict whether the image entails or contradicts the hypothesis, or whether neither is the case (neutral).

Zero-Shot: We first extract the VL features from the pretrained models for the given image and text premise. The features are then directly measured by cosine distance to produce the similarity between the image and the text premise. The challenge is determining what similarity values should constitute entailment, contradiction, or neither. To accomplish this, without finetuning the model, we further perform k-means of the similarities on the validation set, with $k=3$ , and use the resultant clusters as anchors for each output decision. Note that this approach, while a form of transductive learning on the validation set, uses no ground truth labels, and does not change any weights of the pretrained models. We apply this procedure to evaluate both pretrained VL encoders and pretrained VL models.

Low-Shot: There are only 3 types of relationship between the image premise and the text hypothesis (entailment, neutral, and contradiction). Also, different from VCR, an image can pair with around 5 text premises in SNLI-VE. Therefore, we choose the image-based selection. We also have two settings: (1) We select 100 random images for each class, 300 images in total paired with around 1,500 text premises. (2) 1,000 randomly sampled for each class label, then 30,00 samples in total with around 15,000 premises. They both correspond to $0.3\%$ and $3\%$ of SNLI-VE. Each experiment is run more than 4 times.

4.3 Visual Question Answering (VQA)

Different from VCR and VE, for every image-question pair in VQA [2], question-specific multiple choices are not provided. Instead, the global set of all possible answer choices for all the questions are provided (more than 3,000). The challenge is then to select the correct answer choice from this set for the given image-question pair.

Zero-Shot: Comparing with VCR and VE, zero-shot on VQA is more challenging due to the large amount of answer choices (refer to supplement).

Low-Shot: There is not a clear categorization of VQA questions. Based on our analysis of the first n-gram words of questions, we group and finalize to 8 types in total. In VQA, an image is also paired with several questions, 5.4 on average. Thus, we also rely on image-based sampling. We have two settings of image sampling in low-shot settings: 100 random images per question types and 1,000 random image per question types. After collecting up to 2 paired questions per each selected image, we have two low-shot set: a set with around 1,600 questions and another set with 16,000 questions. They both correspond to $0.3\%$ and $3\%$ of VQA. Each experiment is run more than 4 times.

Base (Student)

Model

Method

Teacher Model

Standard Evaluation

SM Evaluation

0.3%

100%

0.3%

100%

100 SP/C

1000 SP/C

100 SP/C

1000 SP/C

Only Finetune

Direct Finetune

CLIP-V

CLIP-T

54.82

38.23

38.10

54.23

58.30

34.85

39.06

52.23

Adapters

34.41

36.34

36.42

33.35

35.02

35.48

VL-BERT

Baseline

23.37

30.85

53.48

75.53

23.24

26.37

49.27

71.13

\star

CLIP-V

CLIP-T

36.78

55.91

76.37

34.93

52.06

73.29

MAD

\star

26.80

40.43

58.98

77.61

26.31

39.27

54.88

74.55

UNITER

Baseline

24.78

31.43

54.24

76.67

24.21

28.43

52.72

73.84

\star

CLIP-V

CLIP-T

39.43

57.64

76.80

38.21

54.58

74.01

MAD

\star

26.78

42.23

60.88

77.05

26.49

41.83

54.64

74.24

VILLA

Baseline

34.84

57.01

78.27

29.41

54.15

75.43

MAD

\star

CLIP-V

CLIP-T

42.95

60.93

78.83

41.97

55.20

76.01

RoBERTa

43.11

61.49

78.91

41.98

56.85

76.32

ViT

CLIP-T

40.73

58.38

78.59

39.44

54.29

75.35

Re-

Pretrain

CLIP-ViL_p

CLIP-V

34.63

53.54

68.36

33.41

52.44

66.83

Table 1: VCR dataset results, including approaches for direct finetuning of adapters, and distillation to various student architectures.

\star

represents our methods. Although we avoid extra pre-training, we include a recent top-performing method utilizing CLIP vision encoder with additional pre-training here for comparison (CLIP-ViL_p). Data sampling shown as number of image-question-answer triplets (0-shot to Full). Few-shot results are averaged over 4 runs. Refer to supplement for standard deviations, details of 0-shot experiments, and sizes of student models. Baseline refers to the original methods without distillation. Their results are based on our re-implementation of the students models with additional hyper-parameter tuning. SM = Shortcut Mitigated.

5 Results

5.1 Visual Commonsense Reasoning (VCR)

Results on the VCR dataset for Q $\rightarrow$ A, across several data availability and domain, shifted constraints, are shown in Tab. 1 (for additional metrics of Q2A and Q2AR, please see supplement). These results yield 5 key observations: 1) From the 1st row, pretrained unimodal encoder models are capable of strong zero-shot results in VCR “out-of-the-box”: When TE and VE are both CLIP’s vision and text encoders, the accuracy is over 58% on short-cut mitigated evaluation (a more difficult domain-shifted task), which is similar to some supervised approaches. 2) Based on the 2nd row, without enough downstream data to finetune, pretrained unimodal encoders may not adapt to the downstream domain well. In fact, new data may disturb the pretrained knowledge and result in even more poor performance than zero-shot.3) Our proposed MAD approach yields significant performance gains under low-shot data regimes across a range of student architectures: up to 52% for VL-BERT, and 47.7% for UNITER. 4) Our proposed MAD approach even benefits fully supervised tasks by 2.1% for VL-BERT and 3.8% for UNITER. 5) Our approach yields even higher gains under low-shot domain-shifted scenarios of shortcut mitigation (SM): up to 71.3% for VL-BERT and 61% for UNITER. 6) With only finetuning and not increasing any computation complexity for the student model, MAD outperforms a prior proposed approach to leverage CLIP’s vision encoder and redo the pretraining steps: CLIP-Vil [45]. 7) The experiments prove that our modularized multimodal distillation method is generalizable and flexible. It can maintain effectiveness regardless of whether the pretrained vision and text encoders come from the same pretraining framework or separate ones.

MAD with VILLA delivers high performance on the public leaderboard (Q2A: 79.6% QA2R: 82.9% Q2AR: 66.2%), achieving a new state-of-art of Q2A performance compared to other single models that are pretrained with image-text data, as well as overall state-of-art comparable performance for QA2R and Q2AR.

As our approach is model agnostic, ensembles are possible. Combining multiple MAD approaches, further significant gains in performance (Q2A: 80.93%, QA2R: 84.01%) in the fully-sampled standard validation set.

Method	V.	L.	TS	CW	AF	Standard Evaluation			SM Evaluation
						0.3%	3%	100%	0.3%	3%	100%
						100 SP/C	1000 SP/C		100 SP/C	1000 SP/C
Baseline	-	-	-	-	-	30.85	53.48	75.53	26.37	49.27	71.13
Unimodal Distillation	Y					34.24	54.53	75.92	31.11	51.16	72.24
Multimodal Distillation	Y	Y				36.78	55.91	76.37	34.93	52.06	73.29
	Y	Y	Y			37.21	57.64	76.83	36.14	53.33	73.78
	Y	Y	Y	Y		38.28	58.68	77.15	37.82	54.12	74.24
Multimodal Adaptive Distillation	Y	Y	Y	Y	Y	40.43	58.98	77.61	39.27	54.88	74.55

Table 2: VCR distillation ablation experiments using VL-BERT student model. Both vision and text teacher encoder models come from CLIP of ViT-B16. Results of low-shot experiments are averaged over more than 4 times (see supplement for standard deviations). L. represents text distillation and V. represents Vision distillation. SM = Shortcut Mitigated.

As shown in Table S3, we performed an ablation study of individual components of our MD and MAD frameworks, evaluated with VL-BERT as the student model. These results demonstrate 3 key observations: 1) Language distillation, not vision (as studied in previous works), contributes the most performance improvement. However, distilling both vision and language perform better than either one alone. 2) Multimodal distillation with an adaptive mechanism like token selection, and confidence weighting can produce better performance than naive sequence-to-sequence multimodal distillation.

Base (Student)

Model

Method

Teacher Model

Validation

Test

0-shot

3,000

30,000

Full

Only Finetune

Direct Finetune

CLIP-V

CLIP-T

47.23

50.91

57.02

66.91

Adapter

38.10

41.65

41.75

VL-BERT

Baseline

33.31

53.28

62.31

74.66

74.02

\star

CLIP-V

CLIP-T

56.02

64.92

75.08

74.93

MAD

\star

35.18

56.78

65.37

75.75

75.43

UNITER

Baseline

32.23

58.36

66.23

79.02

79.19

\star

CLIP-V

CLIP-T

58.95

67.74

80.08

80.16

MAD

\star

33.67

59.42

68.34

80.14

80.23

VILLA

Baseline

34.72

58.47

67.16

79.64

79.32

MAD

\star

CLIP-V

CLIP-T

36.10

59.65

68.43

80.67

80.32

RoBERTa

58.48

68.86

80.64

80.31

ViT

CLIP-T

58.08

66.12

78.37

78.49

Re-

Pretrain

CLIP-ViL_p

CLIP-V

59.48

68.32

80.61

80.2

Table 3: SNLI-VE Results. Baselines represent the original methods without distillation.

\star

represents our methods. Training data subsampling shown under validation results. The low-shot experiment results are averaged over 4 runs.

Base (Student)

Model

Method

Teacher Model

Validation

Test-Val

Test-Std

0.3%

100%

100 SP/C

1000 SP/C

Only Finetune

Adapter

CLIP-V

CLIP-T

16.14

37.41

51.34

VL-BERT

Baseline

35.33

63.29

69.05

71.79

72.22

\star

CLIP-V

CLIP-T

36.83

64.43

70.22

MAD

\star

37.12

65.71

71.42

UNITER

Baseline

36.46

64.43

71.26

73.82

74.02

MAD

\star

CLIP-V

CLIP-T

39.75

65.93

71.94

VILLA

Baseline

37.18

65.75

72.11

74.69

74.87

MAD

\star

CLIP-V

CLIP-T

40.16

66.93

73.02

75.81

76.04

RoBERTa

39.04

66.23

72.43

ViT

CLIP-T

38.79

66.10

72.02

Re-

Pretrain

CLIP-ViL

CLIP-V

39.01

66.84

73.91

76.48

76.70

Table 4: VQA Results. Baselines represent the original methods without distillation.

\star

represents our methods. Training data subsampling shown under validation results. Test results were accomplished using full training set. The low-shot experiment results are averaged over 4 runs.

5.2 Visual Entailment (SNLI-VE)

Results on SNLI-VE are shown in Table 3, revealing 3 key observations: 1) Pretrained unimodal encoders can achieve up to 40.02 % zero-shot accuracy on val set of SNLI-VE “out-of-the-box”, which is significantly higher than baseline methods like VL-BERT and UNITER. 2) MAD continues to provide accuracy improvements in all settings of zero-shot, low-shot, and fully-supervised conditions, across all student base models, versus both baselines with no distillation, and MD approach. 3) MAD continues to outperform concurrent work CLIP-ViL_p under all evaluated conditions.

5.3 Visual Question Answering (VQA)

Table 4 shows our results from VQA. The key observations from these experiments are as follows: 1) MAD yields performance improvement against baseline and MD under all conditions. 2) Under low-shot conditions, MAD can outperform finetuning approach CLIP-ViL_p 3) Under the full-shot setting, MAD can assist the baseline method, VILLA to reach the performance of 73.02 on validation, which is comparable to the performance from CLIP-VIL_p. Further experimentation leveraging the model-agnostic MAD to create ensembles achieves 73.93 % on local validation, outperforming CLIP-ViL_p.

5.4 Analysis

Increased Utilization of Vision Modality: Following [6], we measure Modality Importance (MI) of both modalities. After distillation, refer to Fig. 3, we observe that the Vision MI is increased in (a) and the MI difference between Vision and Text is decreased in (b).

Regulating Shortcuts: Additionally, as V+L models are prone to learn trivial shortcuts from questions to correct answers [27, 61], performance degrades when tested on datasets that mitigate these signals. We further seek to qualitatively better understand how our distillation approach might help improve the performance in this setting. Fig. 4, shows token attention values from a VL-BERT model on an instance from the VCR dataset, before and after distillation, in addition to the token selection scores of our MAD approach. One can see that trivial tokens such as “is” have the highest attention values prior to distillation. By contrast, the token selection forces emphasis on more meaningful terms, such as “sending,” “telegram,” etc.

6 Conclusions

In this work, we explore leveraging pretrained unimodal encoders to improve existing pretrained multi-modal encoders for VL tasks. To accomplish this we present a new approach, Multi-modal Adaptive Distillation (MAD), to perform adaptive distillation from pretrianed unimodal encoders to a variety of high performing multi-modal student encoders, including VL-BERT, UNITER, and VILLA. We evaluate this approach over a new comprehensive VL task protocol involving VCR, SNLI-VE, and VQA, covering zero-shot, low-shot, fully-supervised, and domain-shifted settings. Our results demonstrate significant improvements in performance across datasets, tasks, and settings, compared with baselines.

References

[1] Agrawal, A., Batra, D., Parikh, D., Kembhavi, A.: Don’t just assume; look and answer: Overcoming priors for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4971–4980 (2018)
[2] Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: VQA: Visual Question Answering. In: International Conference on Computer Vision (ICCV) (2015)
[3] Bao, H., Dong, L., Wei, F.: Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021)
[4] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020)
[5] Campos, R., Mangaravite, V., Pasquali, A., Jorge, A., Nunes, C., Jatowt, A.: Yake! keyword extraction from single documents using multiple local features. Information Sciences 509, 257–289 (2020). https://doi.org/https://doi.org/10.1016/j.ins.2019.09.013, https://www.sciencedirect.com/science/article/pii/S0020025519308588
[6] Cao, J., Gan, Z., Cheng, Y., Yu, L., Chen, Y.C., Liu, J.: Behind the scene: Revealing the secrets of pre-trained vision-and-language models. In: European Conference on Computer Vision. pp. 565–580. Springer (2020)
[7] Chada, R., Natarajan, P.: Fewshotqa: A simple framework for few-shot learning of question answering tasks using pre-trained text-to-text models. arXiv preprint arXiv:2109.01951 (2021)
[8] Chen, L., Wang, D., Gan, Z., Liu, J., Henao, R., Carin, L.: Wasserstein contrastive representation distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16296–16305 (2021)
[9] Chen, Y.C., Li, L., Yu, L., Kholy, A.E., Ahmed, F., Gan, Z., Cheng, Y., Liu, J.: Uniter: Universal image-text representation learning. In: ECCV (2020)
[10] Cho, J.W., Kim, D.J., Choi, J., Jung, Y., Kweon, I.S.: Dealing with missing modalities in the visual question answer-difference prediction task through knowledge distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1592–1601 (2021)
[11] Dancette, C., Cadene, R., Teney, D., Cord, M.: Beyond question-based biases: Assessing multimodal shortcut learning in visual question answering. arXiv preprint arXiv:2104.03149 (2021)
[12] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
[13] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
[14] Fang, Z., Wang, J., Hu, X., Wang, L., Yang, Y., Liu, Z.: Compressing visual-linguistic model via knowledge distillation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1428–1438 (2021)
[15] Fang, Z., Wang, J., Wang, L., Zhang, L., Yang, Y., Liu, Z.: Seed: Self-supervised distillation for visual representation. arXiv preprint arXiv:2101.04731 (2021)
[16] Furlanello, T., Lipton, Z., Tschannen, M., Itti, L., Anandkumar, A.: Born again neural networks. In: International Conference on Machine Learning. pp. 1607–1616. PMLR (2018)
[17] Gan, Z., Chen, Y.C., Li, L., Zhu, C., Cheng, Y., Liu, J.: Large-scale adversarial training for vision-and-language representation learning. arXiv preprint arXiv:2006.06195 (2020)
[18] Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6904–6913 (2017)
[19] Gu, X., Lin, T., Kuo, W., Cui, Y.: Zero-shot detection via vision and language knowledge distillation. CoRR abs/2104.13921 (2021), https://arxiv.org/abs/2104.13921
[20] Huang, Z., Zeng, Z., Huang, Y., Liu, B., Fu, D., Fu, J.: Seeing out of the box: End-to-end pre-training for vision-language representation learning. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
[21] Huang, Z., Zeng, Z., Liu, B., Fu, D., Fu, J.: Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849 (2020)
[22] Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q.V., Sung, Y., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918 (2021)
[23] Jiang, Y., Bansal, M.: Avoiding reasoning shortcuts: Adversarial evaluation, training, and model development for multi-hop qa. arXiv preprint arXiv:1906.07132 (2019)
[24] Jiao, X., Yin, Y., Shang, L., Jiang, X., Chen, X., Li, L., Wang, F., Liu, Q.: Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351 (2019)
[25] Kim, W., Son, B., Kim, I.: Vilt: Vision-and-language transformer without convolution or region supervision. In: ICML (2021)
[26] Kim, Y., Rush, A.M.: Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947 (2016)
[27] Kovaleva, O., Romanov, A., Rogers, A., Rumshisky, A.: Revealing the dark secrets of bert. arXiv preprint arXiv:1908.08593 (2019)
[28] Li, J., Selvaraju, R.R., Gotmare, A.D., Joty, S., Xiong, C., Hoi, S.: Align before fuse: Vision and language representation learning with momentum distillation. arXiv preprint arXiv:2107.07651 (2021)
[29] Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L., Wang, L., Hu, H., Dong, L., Wei, F., et al.: Oscar: Object-semantics aligned pre-training for vision-language tasks. In: European Conference on Computer Vision. pp. 121–137. Springer (2020)
[30] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision. pp. 740–755. Springer (2014)
[31] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
[32] Liu, Y., Zhang, W., Wang, J.: Adaptive multi-teacher multi-level knowledge distillation. Neurocomputing 415, 106–113 (2020)
[33] Mun, J., Lee, K., Shin, J., Han, B.: Learning to specialize with knowledge distillation for visual question answering. In: NIPS (2018)
[34] Noh, H., Kim, T., Mun, J., Han, B.: Transfer learning via unsupervised task discovery for visual question answering. In: Proceedings of the IEEE CVF Conference on Computer Vision and Pattern Recognition. pp. 8385–8394 (2019)
[35] Pham, H., Dai, Z., Ghiasi, G., Liu, H., Yu, A.W., Luong, M.T., Tan, M., Le, Q.V.: Combined scaling for zero-shot transfer learning. arXiv preprint arXiv:2111.10050 (2021)
[36] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. CoRR abs/2103.00020 (2021), https://arxiv.org/abs/2103.00020
[37] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)
[38] Ramakrishnan, S., Agrawal, A., Lee, S.: Overcoming language priors in visual question answering with adversarial regularization. arXiv preprint arXiv:1810.03649 (2018)
[39] Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084 (2019)
[40] Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)
[41] Sen, P., Saffari, A.: What do models learn from question answering datasets? arXiv preprint arXiv:2004.03490 (2020)
[42] Shah, M., Chen, X., Rohrbach, M., Parikh, D.: Cycle-consistency for robust visual question answering. In: Proceedings of the IEEE CVF Conference on Computer Vision and Pattern Recognition. pp. 6649–6658 (2019)
[43] Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of ACL (2018)
[44] Shen, S., Li, L.H., Tan, H., Bansal, M., Rohrbach, A., Chang, K.W., Yao, Z., Keutzer, K.: How much can clip benefit vision-and-language tasks? (2021)
[45] Shen, S., Li, L.H., Tan, H., Bansal, M., Rohrbach, A., Chang, K., Yao, Z., Keutzer, K.: How much can CLIP benefit vision-and-language tasks? CoRR abs/2107.06383 (2021), https://arxiv.org/abs/2107.06383
[46] Su, W., Zhu, X., Cao, Y., Li, B., Lu, L., Wei, F., Dai, J.: Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 (2019)
[47] Su, W., Zhu, X., Cao, Y., Li, B., Lu, L., Wei, F., Dai, J.: Vl-bert: Pre-training of generic visual-linguistic representations. In: International Conference on Learning Representations (2020), https://openreview.net/forum?id=SygXPaEYvH
[48] Sun, C., Shrivastava, A., Singh, S., Gupta, A.: Revisiting unreasonable effectiveness of data in deep learning era. In: Proceedings of the IEEE international conference on computer vision. pp. 843–852 (2017)
[49] Sun, S., Cheng, Y., Gan, Z., Liu, J.: Patient knowledge distillation for bert model compression. arXiv preprint arXiv:1908.09355 (2019)
[50] Sun, S., Gan, Z., Cheng, Y., Fang, Y., Wang, S., Liu, J.: Contrastive distillation on intermediate representations for language model compression. arXiv preprint arXiv:2009.14167 (2020)
[51] Sun, Z., Yu, H., Song, X., Liu, R., Yang, Y., Zhou, D.: Mobilebert: a compact task-agnostic bert for resource-limited devices. arXiv preprint arXiv:2004.02984 (2020)
[52] Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. EMNLP abs/1908.07490 (2019), http://arxiv.org/abs/1908.07490
[53] Teney, D., Hengel, A.v.d.: Zero-shot visual question answering. arXiv preprint arXiv:1611.05546 (2016)
[54] Tian, Y., Krishnan, D., Isola, P.: Contrastive multiview coding. In: European conference on computer vision. pp. 776–794. Springer (2020)
[55] Wang, L., Yoon, K.J.: Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021)
[56] Wang, Z., You, H., Li, L.H., Zareian, A., Park, S., Liang, Y., Chang, K.W., Chang, S.F.: Sgeitl: Scene graph enhanced image-text learning for visual commonsense reasoning. arXiv preprint arXiv:2112.08587 (2021)
[57] Wang, Z., Yu, J., Yu, A.W., Dai, Z., Tsvetkov, Y., Cao, Y.: Simvlm: Simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904 (2021)
[58] Xie, N., Lai, F., Doran, D., Kadav, A.: Visual entailment: A novel task for fine-grained image understanding. arXiv preprint arXiv:1901.06706 (2019)
[59] Xu, C., Zhou, W., Ge, T., Wei, F., Zhou, M.: Bert-of-theseus: Compressing bert by progressive module replacing. arXiv preprint arXiv:2002.02925 (2020)
[60] Yang, J., Martinez, B., Bulat, A., Tzimiropoulos, G.: Knowledge distillation via adaptive instance normalization. arXiv preprint arXiv:2003.04289 (2020)
[61] Ye, K., Kovashka, A.: A case study of the shortcut effects in visual commonsense reasoning. In: Association for the Advancement of Artificial Intelligence (AAAI) (2021)
[62] Yu, F., Tang, J., Yin, W., Sun, Y., Tian, H., Wu, H., Wang, H.: Ernie-vil: Knowledge enhanced vision-language representations through scene graph. In: AAAI. vol. 1, p. 12 (2021)
[63] Zellers, R., Bisk, Y., Farhadi, A., Choi, Y.: From recognition to cognition: Visual commonsense reasoning. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)
[64] Zellers, R., Lu, X., Hessel, J., Yu, Y., Park, J.S., Cao, J., Farhadi, A., Choi, Y.: Merlot: Multimodal neural script knowledge models. arXiv preprint arXiv:2106.02636 (2021)
[65] Zhang, P., Goyal, Y., Summers-Stay, D., Batra, D., Parikh, D.: Yin and yang: Balancing and answering binary visual questions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5014–5022 (2016)
[66] Zhang, P., Li, X., Hu, X., Yang, J., Zhang, L., Wang, L., Choi, Y., Gao, J.: Vinvl: Revisiting visual representations in vision-language models. In: Proceedings of the IEEE CVF Conference on Computer Vision and Pattern Recognition. pp. 5579–5588 (2021)
[67] Zhong, Y., Yang, J., Zhang, P., Li, C., Codella, N., Li, L.H., Zhou, L., Dai, X., Yuan, L., Li, Y., et al.: Regionclip: Region-based language-image pretraining. arXiv preprint arXiv:2112.09106 (2021)
[68] Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J., Gao, J.: Unified vision-language pre-training for image captioning and vqa. In: AAAI. vol. 34, pp. 13041–13049 (2020)

Supplemental Materials

7 Overview

Section 8.a covers additional details regarding our evaluations on the VCR dataset, including our public leaderboard results (8.a.1), additional metrics for internal validation results (8.a.1, 8.a.1), information about question sub-types (8.a.1), and further analysis of model behavior before and after distillation (8.a.1).

Section 9 provides additional details regarding training code configurations (9.a) and parameters in all of our experiments (9.b).

Section 10 covers additional details regarding our baselines for Multimodal Distillation (MD), Multimodal Adaptive Distillation (MAD) (10.a) and adapters over CLIP (10.b).

Base (Student)

Model

Method

Standard Evaluation

Q2A

QA2R

Q2AR

Prompt w/ zero-shot. IA(R)

54.82

48.58

26.63

Only Finetune

Direct Finetune

Adapters

VL-BERT

Baseline

76.02

78.31

59.53

\star

76.74

78.64

60.35

MAD

\star

77.24

79.02

61.04

UNITER

B Baseline

74.23

76.99

57.15

B MD

\star

75.21

77.59

58.36

B MAD

\star

76.35

77.84

59.43

L Baseline

76.67

79.98

61.32

L MAD

\star

77.05

80.57

62.08

VILLA

Baseline

78.27

82.33

64.44

MAD

\star

78.86

82.57

65.11

Ensemble

MAD

\star

80.48

82.68

66.54

Re-

Pretrain

CLIP-ViL_p [45]

68.36

71.40

48.81

Table S1: Complete VCR Evaluation Results. In the first row, we utilize pretrained unimodal encoders of CLIP-T and CLIP-V to conduct zero-shot evaluation. IA represents that we disregard question infor and only we measure the cosine distance between image feature and answer feature when solving Q2A task. IR represents similar procedure between image and rationale features.

\star

represents evaluation with our methods.

Method	Teacher Model		Variation	Standard Evaluation				SM Evaluation
	VE	TE	Prompt	0%	0.3%	3%	100%	0%	0.3%	3%	100%
Zero-shot	CLIP-V	CLIP-T	IQA	50.27				58.3
Zero-shot	CLIP-V	CLIP-T	IA	54.82				53.82
			Adapter Head
Adapter	CLIP-V	CLIP-T	1 Linear		30.13	30.39	31.86		28.1	27.43	29.31
			3 Linear		30.43	30.86	32.31		28.73	28.32	29.94
			1 Transformer		34.41	36.34	36.42		33.35	34.02	35.48

Table S2: Evaluation of zero-shot and adding adapters performance on CLIP-V, CLIP-T. IQA represents that we group both question and answer together as the text prompt. When calculating the cosine distance in each sample, we measure the distance between the image feature and the text prompt feature.

Method

Variation

Distillation

Weight

Token #

Standard Evaluation

SM Evaluation

0.3%

100%

0.3%

100%

100 SP/C

1000 SP/C

100 SP/C

1000 SP/C

Baseline

30.85

53.48

75.53

26.37

49.27

71.13

Unimodal

Distillation

0.05

34.24

54.53

75.92

31.11

51.16

72.24

Multimodal Distillation (MD

\star

)

Different Distillation Weight

1.0

33.39

51.57

75.85

33.22

50.05

71.65

0.5

34.20

52.96

75.89

33.31

51.17

72.10

0.1

34.47

54.72

76.10

33.64

51.32

73.20

0.05

36.78

55.91

76.37

34.93

52.06

73.29

0.01

35.31

53.65

76.23

31.34

51.23

72.03

+ Token Selection

Different Token #

Adaptive

0.05\cdot\frac{Token\#+2}{2}

36.59

56.49

76.37

34.70

52.03

72.30

37.21

57.64

76.83

36.14

53.33

73.78

35.25

54.04

76.16

35.66

51.56

72.41

w/o language prior

(random selection)

(Random)

30.91

53.24

75.34

29.02

50.43

72.13

+ Confidence

Weighting (CW)

38.28

58.68

77.15

37.82

54.12

74.24

+ Adaptive Finetuning (AF)

w/o distillation

(2nd-stage ”pretraining”)

38.32

54.13

76.38

36.42

53.07

72.87

w/ distillation

(MAD

\star

)

Adaptive

0.05\cdot\frac{Token\#+2}{2}

39.02

56.23

76.93

38.26

55.87

74.04

MAD (S)

+ Semi-supervision

54.43

61.33

77.61

46.71

58.18

74.55

Table S3: VCR distillation ablation experiments using VL-BERT student model. Training data subsampling shown under evaluation protocol. SM = Shortcut Mitigated.

8 Evaluation

8.a VCR

8.a.1 Public Leaderboard Results

We submitted our single model test prediction results to the VCR public leaderboard, which is currently listed as the 12th entry overall (including ensemble models and models without any reference or publication) and achieves State-Of-The-Art (SOTA) performance on VCR compared to other public single models ³³3single models with public reference or publication that are pretrained with image-text data. The entry is displayed under the name, CLIP-TD⁴⁴4A former name with CLIP’s vision and text encoders as the teacher models. with test result: Q2A $79.6\%$ QA2R $82.9\%$ Q2AR $66.2\%$ . For more thoroughly comparing our work with others, we also submitted its entry to the other VCR public leaderboard hosted by DARPA. It is ranked as the 4th overall entry with a comparatively weaker base model among our student models, VL-BERT [47]. Meanwhile, we are also submitting our ensemble models’ prediction results to the leaderboard, as shown in the second last row of Tab. S1.

8.a.1 Standard Validation Results

As in Tab. S1, we include comparison of our methods on top of top-performing base models for all three evaluation metrics: Q2A, QA2R and Q2AR. Our method consistently improves on top of the base models across all metrics.

8.a.1 Low-shot

For every low-shot experiment across VCR, VQA and SNLI-VE, we average every experiment’s result over at least four runs. The corresponding variance is also listed in Tab. S4, Tab. S7 and Tab. S6.

Base (Student)

Model

Method

Teacher Model

Standard Evaluation

SM Evaluation

0.3%

Var

0.3%

Var

100 SP/C

1000 SP/C

100 SP/C

1000 SP/C

VL-BERT

Baseline

30.85

1.04

53.48

0.28

26.37

0.47

49.27

0.83

\star

CLIP-V

CLIP-T

36.78

1.48

55.91

0.85

34.93

0.85

52.06

1.35

MAD

\star

40.43 (+9.58)

\blacklozenge

0.44

58.98 (+5.5)

\blacklozenge

0.51

39.27 (+12.9)

\blacklozenge

0.28

54.88 (+5.61)

\blacklozenge

0.61

VILLA

Baseline

34.84

0.87

57.01

0.72

29.41

0.90

54.15

1.34

MAD

\star

CLIP-V

CLIP-T

42.95

1.14

60.93

1.51

41.97

0.59

55.20

1.07

RoBERTa

43.11 (+8.27)

\blacklozenge

1.32

61.49 (+4.48)

\blacklozenge

0.89

41.98 (+12.57)

\blacklozenge

1.15

56.85 (+2.7)

\blacklozenge

1.23

Table S4: Low-shot evaluation on VCR with variance. Every low-shot experiment result is averaged over four runs.

\blacklozenge

The value in () representing the difference against the baseline.

8.a.1 Shortcut Mitigation Results

For better evaluating our algorithms’ performance in settings in real world scenarios where domain shift may often occur. We follow [61]’s modification and compare our method against others in its modified validation set. The evaluation results are listed in Tab. 1 and Tab. 2 of the main paper submission. The modified example is shown in Fig. S2. Motivated by the observation that ”The correct option has the most overlap with the question” as showned in Fig. LABEL:fig:debias and stated in [61], the modification mainly focus on changing the pronouns of the correct and incorrect answer choices. As showned in Fig. S2, the correct answers’ pronouns are changed to be different from the question and the incorrect answer choices’ pronouns are changed to be the same, instead.

8.a.1 VCR Dataset Question Sub-Types

According to [63], among VCR questions, 38% fall into explanation (why, how come, etc.), 24% activity (doing, looking, event, etc.), 13% temporal (happened, before, after, etc.), 8% mental (feeling, thinking, etc.), 7% role (relations, occupations, etc.), 5% scene (where, near, etc.), and 5% hypothetical (if, would, could, etc.). Details can be referred to [63].

8.a.1 Further Analysis of Multimodal Distillation

Based on our results, we have seen the impact on end task performance of distilling knowledge from largely pretrained unimodal encoders into student VL models. However, a key question still remains: how much the improvement comes from the distillation of each modality respectively?

Following [6], we measure the Modality Importance (MI) of both visual modality and textual modality. This approach sums the attention weights across heads of each modality to understand how much each modality is weighted by the model. Fig. 3 shows the average the MI values of all the heads for each layer on VL-BERT, both with and without MD, trained on the VCR dataset. One can clearly observe that prior to distillation, the model more heavily weights the text modality as being important to correctly choosing answers. After distillation, both vision and text modalities are more equally considered. This may also explain why the model yields such impressive performance improvements in low-shot and domain shifted scenarios.

Fig. 3, we plot the MI values of all the heads across 12 layers in VL-BERT Base and VL-BERT Base with MD. It is obvious that, at the last layer, the textual MI heatmap on the right is denser than the visual MI heatmap on the left. This shows a common flaw from existing V+L models that they heavily rely on the textual information than the visual part indicating the shallow understanding of the visual scene in downstream tasks. However, in the bottom row, the difference between the left and right heatmaps is much smaller and the visual MI heatmap at the bottom is also clearly more denser than the one at the top. This could be further verified by

8.b VQA

8.b.2 Zero-shot

VQA is different from VCR and SNLI-VE in terms of number of answer labels for each sample data. For every image-question pair, VCR provides four answer choices and for every premise-hypothesis pair, only three answer labels are provided. While in VQA, for every image-question pair, more than 2000 answer labels are provided. Thus the format of VQA does not strictly follow the conventional Multiple-Choice-Question (MCQ) format. Given the vast variety of answer labels provided, every question in VQA may potentially map to more than one answer choice. This one-to-many mappings may introduce difficulty in evaluating learning algorithms’ ability of answering the VQA questions correctly. Different from [45] which naively measure the distance between image-question pair against all the answer labels and deliver a performance of almost 0 $\%$ . We highly suspect that is the objective solution in terms of evaluating largely pretrained unimodal encoders’ generalization in solving VQA tasks.

In this work, we first apply fixed number of templates embedded with heuristic knowledge and convert every question into statement. Then we further utilize pretrained sentence embedding [39] to measure the cosine distance between every question against all the answer labels. Based on the ranking of the cosine measurement, we filter all the answer labels to three set of answer candidates: Top 1, Top 3 and Top 10. We utilize CLIP’s inference steps and prompt engineering and evaluate zero-shot performance of CLIP-V and CLIP-T across the three set, as shown in Tab. S5.

Question Type	Top 1	Top 3	Top 10
Overall	21.77	53.67	71.64
Binary	14.01	41.12	41.12
Number	2.75	50.9	9.62
Others	6.58	10.37	19.63

Table S5: Zero-shot performance of CLIP-V and CLIP-T on VQA.

8.b.2 Low-shot

For every low-shot experiment across VCR, VQA and SNLI-VE, we average every experiment’s result over at least four runs. The corresponding variance is also listed in Tab. S4, Tab. S7 and Tab. S6.

Base (Student)

Model

Method

Teacher Model

Validation

0.3

\%

(100 SP/C)

\%

(1000 SP/C)

Accuracy

Var

Accuracy

Var

VL-BERT

Baseline

53.28

0.12

62.31

0.46

\star

CLIP-V

CLIP-T

56.02

0.46

64.92

0.74

MAD

\star

56.78 (+3.5)

\blacklozenge

1.31

65.37 (+3.06)

\blacklozenge

1.09

VILLA

Baseline

58.47

0.76

67.16

0.28

MAD

\star

CLIP-V

CLIP-T

59.65 (+1.18)

\blacklozenge

0.13

68.43

0.15

RoBERTa

58.48

0.43

68.86 (+1.70)

\blacklozenge

0.66

Table S6: Low-shot evaluation on VQA with variance. Every low-shot experiment result is averaged over four runs.

\blacklozenge

The value in () representing the difference against the baseline.

8.c SNLI-VE

8.c.3 Low-shot

For every low-shot experiment across VCR, VQA and SNLI-VE, we average every experiment’s result over at least four runs. The corresponding variance is also listed in Tab. S4, Tab. S7 and Tab. S6.

Base (Student)

Model

Method

Teacher Model

Validation

0.3%

(100 SP/C)

(1000 SP/C)

Accuracy

Var

Accuracy

Var

VL-BERT

Baseline

35.33

0.31

63.29

0.52

\star

CLIP-V

CLIP-T

36.83

0.38

64.43

0.27

MAD

\star

37.12 (+1.79)

\blacklozenge

0.58

65.71 (+2.42)

\blacklozenge

1.02

VILLA

Baseline

37.18

0.44

65.75

0.30

MAD

\star

CLIP-V

CLIP-T

40.16 (+2.98)

\blacklozenge

1.09

66.93(+1.18)

\blacklozenge

0.82

RoBERTa

39.04

0.53

66.23

0.62

Table S7: Low-shot evaluation on SNLI-VE with variance. Every low-shot experiment result is averaged over four runs.

\blacklozenge

The value in () representing the difference against the baseline.

9 Training Details

9.a Baselines

All the baseline experiment results with models including VL-BERT, UNITER Base, UNITER Large, VILLA and CLIP-ViL_p are based on code provided by the authors, which we modified to include our distillation methods. CLIP-ViL_p did not originally evaluate on VCR in their pre-print. However, we evaluated it on the VCR dataset.

9.b Implementation Details

- Token Selectioin: Our ablation experiments with TS distillation show that when the number of tokens selected is 2, the highest performance is obtained, shown in Tab. S3.

- Confidence Weighting: With Confidence Weighted (CW) knowledge distillation method, during our experiment, we find out that the performance would achieve the optimal gain when the distillation is conducted across all four question-answer pairs instead of question-correct-answer pair only, as in Tab. S3.

- Adaptive Finetuning (AF) with Contrastive Knowledge: Before the last-step finetuning for the target downstream tasks, we conduct Adaptive Finetuning to adapt the model to the downstream domain. During the Adaptive Finetuning, the model is trained on the full training set. The loss contains the construction loss from the same set of pretraining tasks like Masked Language Modeling (MLM), Image-Text Matching (ITM), etc. as in [9]. Additionally, we also include Naive Knowledge Distillation thus the final loss also includes the knowledge distillation loss besides the construction loss.

- VL-BERT: We train our model for 30 epochs with warm-up steps of 1000, SGD optimizer. Initial learning rate is $7.0e-5$ and decays by 0.1 at the 14th, 18th and 26th epoch. The gradient accumulation steps is set to be 4 on 8 NVIDIA V100 GPUs (32GB VRAM). The total number of layers for VL-BERT is 24 VL-BERT_Large.

- UNITER: Started with warm up steps of 800, the model is trained with total steps of 8000. With AdamW optimizer, the intial learning rate is set to be $6e-05$ with weight decay of 0.01 and batch size of 4000. The gradient accumulation steps is set to be 5 on 4 NVIDIA TITAN RTX GPUs (24GB VRAM).

- VILLA: Warm up steps is set to be 1000 and total training steps is 10000. The intial learning rate is $6e-05$ with weight decay of 0.01 and AdamW optimizer. The training batch size is 1250. The gradient accumulation steps is set to be 8 on 8 NVIDIA TITAN RTX GPUs (24GB VRAM).

- CLIP-ViL_p: The model is trained for 20 epochs with batch size of 24. The optimizer is AdamW with a peak learning rate of 5 x $10^{-5}$ .

10 Baseline Method

10.a Multimodal Distillation

As illustrated in the top rows of Tab. S3, we experiment with a wide spectrum of disllation weight and realize that the performance is best optimized when the weight is set to be 0.05.

10.b Adapters over CLIP

For more comprehensively exploring different options to utilize the pretrained unimodal encoders for downstream Vision-Language tasks, we also experimented to directly add adapters on top of unimodal encoders. As listed in Tab. S2, we eseentially have experimented with adding either MLP or attention layers on top of CLIP-V and CLIP-T. However, due to the large gap between the size of finetuning data and CLIP’s original pretraining data, the adapters’ limited capacity fail to adapt the preatrained unimodel encoders for the downstream tasks efficiently.