\definecolor

mydarkbluergb0,0.08,0.45 \definecolormyblueHTML3b75c9 \definecolormyredHTMLE33222 \definecolormygreenHTML438773 \definecolormymaroonRGB142,27,19 \definecolormaroonHTML992000 \definecolormycitecmyk0.55,1,0,0.15 \definecolorcodebluergb0.25,0.5,0.5 \definecolorcodekwrgb0.85, 0.18, 0.50 \definecolorcodegreenrgb0,0.6,0 \definecolorcodegrayrgb0.5,0.5,0.5 \definecolorcodepurplergb0.58,0,0.82 \definecolorbackcolourrgb0.95,0.95,0.92 \definecolorrefcolorHTML9F363A

Pretrained Image-Text Models are Secretly Video Captioners

Abstract

Developing video captioning models is computationally expensive. The dynamic nature of video also complicates the design of multimodal models that can effectively caption these sequences. However, we find that by using minimal computational resources and without complex modifications to address video dynamics, an image-based model can be repurposed to outperform several specialised video captioning systems. Our adapted model demonstrates top-tier performance on major benchmarks, ranking 2nd on MSR-VTT and MSVD, and 3rd on VATEX. We transform it into a competitive video captioning system by post-training a typical image captioning model BLIP-2 with only 6,000 video-text pairs and simply concatenating frames—significantly fewer data than other methods, which use 2.5 to 144 million pairs. From a resource optimization perspective, this video captioning study focuses on three fundamental factors: optimizing model scale, maximizing data efficiency, and incorporating human-standard reinforcement supervision. The code, datasets, and models are released: https://anonymous.4open.science/r/ic2vc.

1 Introduction

Vision-language pretraining significantly advances multimodal tasks such as captioning, question answering, and retrieval \citepliu2023llava, liu2023improvedllava, li2023blip, dai2023instructblip, chen2023vast, kuo2023mammut, mplug2. Among these, video captioning stands out as it narrates visual concepts and their temporal interactions, reflecting the intricate multimodal processes as humans to perceive and articulate dynamic visual experiences.

Current video-text methods often incorporate intricate designs tailored to video inputs. For instance, some models extend existing frameworks by integrating frame samplers to capture temporal dynamics \citepalayrac2022flamingo, yang2021, xu2021videoclip. Other approaches, such as ALPRO \citepli2021alignprompt and VIOLET \citepfu2023empiricalmvm, propose end-to-end models that are meticulously trained on large-scale video-text datasets sourced from the Web \citepzellers2021merlot, Bain21. Despite their success, video captioning models remain highly resource-intensive, often hitting performance bottlenecks when (i) computational resources are constrained, or (ii) the task requires specialized priors without clear guidance for model design and training. This raises a critical question: for simplicity and efficiency, how can we repurpose existing image captioning models for video captioning, without relying on complex, hand-crafted video-specific designs?

Refer to caption — Figure 1: Key factors in recycling BLIP for video captioning: Model – assessing the scale and trainability of components like the ViT, LLM, and Q-Former; Data – examining the volume, quality, and fusion strategies for image and video-text pairs; Supervision – employing reinforcement learning to align generated captions with human language quality standards (CIDEr).

To address this, we revisit fundamental factors in training—model scale, data efficiency, and supervision—that critically influence video captioning while being agnostic to the variants of video-specific designs: First, we find that moderate-sized language models (LMs) when fine-tuned for specific tasks, can meet the demands of video captioning efficiently. This challenges the common belief that larger models are always superior, demonstrating that targeted optimization can outperform sheer model size. Second, using extensive pretraining on image-text pairs, as demonstrated with BLIP-2, is transferable to video tasks. This allows the model to achieve high performance with minimal video usage, offering an efficient alternative to training from scratch. Third, instead of relying on traditional cross-entropy loss, we optimize directly for non-differentiable CIDEr with reinforcement learning, ensuring that the generated captions better align with human-standard video descriptions.

By bypassing complex, specialized video input designs, our experiments demonstrate that BLIP-2 straightforwardly derived from image captioning, can be effectively optimized to deliver competitive video captioning performance. This study underscores the potential of simplicity and efficiency in advancing multimodal video captioning, providing a streamlined yet stable solution. All code, datasets, and model weights are at: https://anonymous.4open.science/r/ic2vc.

2 Recycling BLIP-2 for Video Captioning

As shown in Fig. 1, we adapt BLIP-2, a typical image-text model (details in App. B), for video captioning without any additional parameters. Each video frame is encoded by ViT, which generates visual tokens that are concatenated to form a unified representation (e.g., an 8-frame video produces a token sequence of size 8 $\times$ 256). This unified token sequence is then processed by the Q-former and passed to the LM to generate captions.

3 Training Recipes: Model, Data, and Supervision

According to Tab. 1, our solution has top-level performance on important benchmarks (particularly on the CIDEr metric-the primary ranking measure on Paperswithcode), ranking 2nd on MSR-VTT and MSVD, and 3rd on VATEX, among models with publicly available code. More importantly, it proves to be highly efficient without any video architecture design, using only 6k video-text pairs—significantly less than the million-level datasets required by competing baselines.

Additional background is in App. A. The settings are detailed in App. C, and further experiments (ablations, other datasets, and other video tasks) supporting the following analysis are in App. D.

Model	MSR-VTT \citepxu2016msr				MSVD \citepchen2011collecting				VATEX \citepwang2019vatex				Code	# video
-text
	C.	M.	R.	B4.	C.	M.	R.	B4.	C.	M.	R.	B4.		-
IcoCap	60.2	31.1	64.9	47.0	110.3	39.5	76.5	59.1	67.8	25.7	53.1	37.4	No	-
MaMMUT	73.6	-	-	-	195.6	-	-	-	-	-	-	-	No	-
VideoCoCa	73.2	-	68.0	53.8	-	-	-	-	77.8	-	54.5	39.7	No	144.7M
VALOR	74.0	32.9	68.0	54.4	178.5	51.0	87.9	80.7	95.8	29.4	57.4	45.6	Yes	1.18M
VLAB	74.9	33.4	68.3	54.6	179.8	51.2	87.9	79.3	-	-	-	-	No	10.7M
GIT2	75.9	33.1	68.2	54.8	-	-	-	-	-	-	-	-	Yes	-
VAST	78.0	-	-	56.7	-	-	-	-	99.5	-	-	45.0	Yes	27M
mPLUG-2	80.0	34.9	70.1	57.8	165.8	48.4	85.3	70.5	-	-	-	-	Yes	2.5M
Ours	79.5	34.2	68.3	52.4	168.0	48.3	85.8	73.5	87.1	29.1	56.7	43.3	Yes	6K

Table 1: Overall comparison. The results for MSR-VTT, MSVD, and VATEX are from the PaperswithCode open leaderboard. The abbreviations C., M., R., and B4. stand for CIDEr, METEOR, ROUGE-L, and BLEU-4, respectively. We choose CIDEr as the most referential metric, following the PaperswithCode. Tab. C.2 has details about configs and references.

3.1 Model Scale

Trainability: modal connector $>$ LLM $>$ ViT

To evaluate the adaptability of various components within the video captioning model, we conducted ablation studies using three setups: training all components, freezing the ViT only, and training the Q-Former only. The results, illustrated in Fig. 2(a) and supported by training curves in Fig. 4 (see App. D.1.1 for detailed discussions), reveal a clear performance hierarchy: freezing the ViT (configurations ii and iii) yields higher performance than training all components (configuration i).

Configurations with a frozen ViT allow the Q-Former and LLM to effectively leverage the pre-trained visual features, leading to better alignment in video captioning tasks. Conversely, training the ViT alongside other components introduces potential overfitting and alignment issues, resulting in suboptimal performance. The analysis establishes a hierarchy of trainability: Q-Former $>$ LLM $>$ ViT. The Q-Former shows the highest adaptability during training, followed by the LLM, which benefits from fine-tuning language data. In contrast, the ViT demonstrates the least trainability, as updating its parameters often disrupts the alignment between visual features and language output.

Supporting figures indicate that the Q-Former configuration achieves the most stable performance, reaching peak validation CIDEr scores without significant overfitting (Fig. 4). This pattern aligns with additional observations in App. D.1.1, confirming that focusing on training the modal connector and LLM while freezing the ViT optimizes the model’s performance on video captioning tasks.

Mid-sized LLMs offer trainability for video captioning

We analyzed the impact of LM size on video captioning by comparing three models: OPT-2.7B, Flan-T5-XL-3B, and Vicuna-7B (see Fig. 2(b) and Fig. 5). The results demonstrate that Flan-T5-XL-3B, a mid-sized model, achieves superior performance in generating video captions, outperforming both the smaller OPT-2.7B and the larger Vicuna-7B on key metric CIDEr. This challenges the notion that larger LMs always yield better results in multimodal tasks.

Training dynamics further support the advantages of mid-sized LLMs. As shown in Fig. 5, the smaller OPT-2.7B model requires 20 epochs to reach peak performance and fails to overfit, indicating limited expressiveness. On the other hand, Vicuna-7B converges rapidly within 5 epochs but quickly shows signs of overfitting, suggesting that its added complexity may not translate into meaningful improvements for video captioning. Flan-T5-XL-3B strikes a balance, reaching peak validation within 14 epochs and maintaining a better trade-off between generalization and overfitting.

These findings and training procedure analysis in App. D.1.2 indicate video captioning tasks benefit more from models capable of descriptive processing rather than advanced conversational or reasoning abilities. Thus, mid-sized LMs like Flan-T5-XL-3B effectively balance trainability, efficiency, and performance in video captioning tasks.

3.2 Data Efficiency

Image-Text pretraining offers transferability to video tasks

We examine the effect of image-text pretraining on video captioning by comparing the performance of two BLIP-2 models pre-trained on different dataset sizes: one on 129 million pairs (officially released) and the other on 4 million pairs (reproduced in-house). As depicted in Fig. 2(c), the model pre-trained with 129M pairs achieves a significantly higher CIDEr score (71.3) compared to the model trained with only 4M pairs (65.7), underscoring the advantages of using a larger dataset.

Fig. 6 (in App. D.2.1) further reveals that the model trained on 129M pairs converges faster and achieves higher performance than the model trained on fewer pairs. This suggests that video captioning tasks require robust grounding, with larger datasets significantly enhancing the model’s ability to map visual concepts to language.

These results further underscore the efficiency of reusing extensively pre-trained image-text models for video tasks. Large-scale data exposure improves the model’s comprehension of visual content, making it more suitable for generating accurate video captions. For a detailed analysis of the training process, refer to App. D.2.1.

Lower resolution efficiently supports video captioning

We examined the impact of video resolution on training video captioning models by comparing two settings: 224 $\times$ 224 and 364 $\times$ 364. As shown in Fig 3(b) and 7, models trained with lower-resolution videos (224 $\times$ 224) achieve competitive performance compared to those trained with higher resolution (364 $\times$ 364), despite exhibiting slightly more fluctuating training curves.

The results reveal that when basic frame aggregation techniques such as averaging or concatenation are used, lower resolution proves to be not only sufficient but also more efficient for generating accurate captions. The competitive CIDEr obtained with 224 $\times$ 224 resolution indicates that coarse visual information is adequate for the model to perceive and generate descriptive captions effectively.

Moreover, Fig. 7 demonstrates that while higher resolution (364 $\times$ 364) can lead to more stable training dynamics, the benefits are minimal when sophisticated frame aggregation is not applied. These findings suggest that adopting lower resolution offers practical advantages, including reduced computational requirements, without compromising captioning performance. For further insights, see the detailed analysis in App. D.2.2.

Frame concatenation effectively captures temporality

We evaluate two approaches for temporal fusion in video captioning: frame averaging and frame concatenation. Frame averaging computes the average of visual tokens across sampled frames, maintaining a fixed dimension. In contrast, frame concatenation extends the token sequence by concatenating visual tokens from each sampled frame, preserving more granular temporal information. These fused tokens are subsequently processed by the Q-Former for caption generation.

The training dynamics, illustrated in Fig. 8 and Fig. 3 (a), show that models using frame concatenation consistently outperform those using frame averaging on CIDEr. The model with frame concatenation reaches peak validation performance around epoch 8 (Fig. 8), indicating that this method effectively retains temporality. In contrast, frame averaging shows significant performance oscillations after epoch 5, suggesting that it fails to capture sufficient temporal details for stable training.

These findings indicate that frame concatenation is more effective for capturing temporal information in video captioning, as it retains detailed visual context across frames. This approach allows the LM to access a richer set of visual concepts, resulting in more accurate and coherent captions. For additional analysis, see App. D.2.3.

3.3 Training Supervision

Reinforcement learning aligns captioning with human preference

Traditional video captioning methods often rely on cross-entropy loss, which fails to fully align with human preferences for natural sentence generation. To address this, we use SCST \citeprennie2017self, which directly optimizes toward the human-like CIDEr metric. SCST leverages policy gradients from the non-differentiable CIDEr objective to guide updates to the Q-Former, LLM, and LoRA layers, enhancing alignment with human evaluation standards.

Fig. 2(d) and 9 show that SCST improves CIDEr scores by approximately 6.5% for Flan-T5-XL-3B and 3.4% for Vicuna-7B, while also boosting other metrics such as METEOR and ROUGE-L. Additionally, Fig. 9 illustrates a decoupling effect between training loss and validation CIDEr; models trained with SCST achieve higher CIDEr scores despite fluctuations in training loss. This shift reflects a prioritization of metrics aligned with human judgment over mere loss minimization.

The smaller improvement for Vicuna-7B likely results from its prior alignment training, which already incorporates reinforcement-based methods. Overall, SCST effectively aligns the training process with human-centered metrics, demonstrating its value for improving video captioning models. See App. D.3 for further details.

4 Discussion and Conclusion

This study stands out from existing video captioning research by identifying three factors—model scale, data efficiency, and training supervision—that are critical for effectively adapting image captioning models to video tasks. By using these insights to reuse the image-based BLIP-2 model for video tasks, our solution with minimal resource usage ranks 2nd, 2nd, and 3rd on MSR-VTT, MSVD, and VATEX. This open-source guide provides a foundation for future research aimed at optimizing resource allocation in video captioning and refining post-training techniques.

Limitations

Our open-source solution is currently tailored specifically for video captioning tasks due to the page constraints of this short track. While this focus allows for a detailed and resource-efficient guide, it has not shown immediate applicability to other tasks. However, the methods presented can still be extended to broader applications, in particular to facilitate large-scale pseudolabeling for videotext datasets.

This approach is particularly valuable in specialized domains where annotated data is scarce, providing an efficient way to significantly expand video-text data resources. Similar to how the LAION dataset has advanced the image-text field by leveraging BLIP-1 for large-scale pseudolabeling \citepli2022blip, schuhmannlaion, our work aims to bring comparable improvements to video-text integration, enabling further research and development in this area.

Appendix A Related Work and Background

Image-Text Models

Large-scale pretraining has revolutionized the field of image-text models, enabling significant advances. Models such as CoCa \citepyu2022coca and SimVLM [wang2022simvlm], which are trained from scratch on billions of image-text pairs, have set new benchmarks in generative tasks such as open-ended visual question answering (VQA) and visual captioning. BLIP-2 addresses the computational demands of pretraining from scratch by reusing existing pre-trained parameters from Vision Transformer (ViT) and LLMs and integrating them with a frozen pre-trained state. A key innovation in BLIP-2 is the introduction of the Q-former connector, carefully designed to enhance the interaction between visual and language modalities \citepli2023blip. This methodology has inspired subsequent innovations in visual-lingual tuning, with newer models often incorporating the pre-trained Q-former alongside the eva-vit-g model from BLIP-2, demonstrating the lasting impact of this methodology \citepinstructblip, zhu2023minigpt, 2023videochat.

Video-Text Models

Video-text models typically extend the capabilities of image-text models by integrating temporal feature aggregation to capture dynamic content, as exemplified by VideoCoCa \citepyan2022videotext. In addition, specialized models such as Video-LLaMA enhance the processing of temporal dynamics by embedding multiple temporal Q-former layers, facilitating nuanced interactions across modalities. Such advances refine the synergy between video Q-formers and LLMs within the model architecture, building on the foundation of BLIP-2 \citepdamonlpsg2023videollama. Building on these developments, recent studies, including VideoChat, PandaGPT, Valley, and Video-ChatGPT, investigate the embedding of frozen LLMs into video LMs, pushing the boundaries of the field \citep2023videochat, su2023pandagpt, luo2023valley, Maaz2023VideoChatGPT. In our study, we use BLIP-2 as a basic model for captioning, first pre-trained on images and then adapted to video by incorporating a video frame merging mechanism that effectively captures temporal nuances. This simplicity allows us to focus on evaluating the effects of model size, data volume, and training strategies on video captioning performance as we scale.

Difference between Image and Video Captioning

The fundamental difference between image and video annotation stems from their source inputs: image annotation processes a single static image, while video annotation requires an understanding of the temporal dynamics over a sequence of frames. When adapted to video, pre-trained image models such as GIT \citepwang2022git, VideoCoCa \citepyan2022videotext, and IcoCap \citepliang2023icocap show remarkable adaptability to video with only moderate modifications, demonstrating their transferability. Conversely, video-specific models, including Video-LLaMA \citepdamonlpsg2023videollama and VideoChat \citep2023videochat, use different sampling techniques to effectively capture temporal dynamics. Furthermore, models such as ALPRO \citepli2021alignprompt and VIOLET \citepfu2023empiricalmvm utilize extensive web-crawled datasets to achieve end-to-end training, enriching their learning process. In our study, instead of emulating the complex adaptations typical of specialized video models, we adopt a streamlined approach that uses averaging or concatenation to merge temporal information from sampled video frames. This method allows us to focus on evaluating the effects of model size, data volume, and training strategies on video captioning performance as we scale.

Appendix B Preliminary

To effectively analyze the impact of specialized video adaptations without the confounding effects of architectural design variations, we base our methodology on BLIP-2, a basic image captioning model. We then describe the rationale for selecting BLIP-2 for our study.

Architecture of BLIP-2

BLIP-2 is originally designed to convert images into captions through a simple pipeline consisting of three main components: Vision, Connector, and Language: (i) Vision ViT serves as the entry into the BLIP-2 architecture, encoding images into a series of visual tokens. For example, a 224 $\times$ 224 image is transformed into 256 different visual tokens, laying the foundation for subsequent processing; (ii) Modal connector Q-former, positioned between ViT and LLM, bridges the gap between visual and language modalities. Its primary function is to project the sequence of the visual tokens generated by the ViT into a format compatible with language processing. A distinctive feature of the Q-former is its ability to condense the visual token array to a predetermined size, typically 32 tokens, regardless of the original number. This token reduction is not simply a numerical compression, but involves a sophisticated transformation into a language modality, resulting in so-called soft prompts. These soft prompts, now in tensor form, are then passed to the LLM for caption generation; (iii) Language LLM is responsible for generating the textual captions. It interprets the soft prompts from the Q-former and weaves them into a coherent caption that accurately reflects the visual content. This step is the culmination of the BLIP-2 pipeline, which transforms visual input into descriptive language.

Rationale for Choosing BLIP-2 as the Base Model

In the field of vision language generative learning, many pre-trained image-based vision LMs are possible candidates besides BLIP-2, such as the LLaVA series, miniGPT-4, OpenCoCa, and OpenFlamingo, each offering different capabilities and features. Given the wide range of options available, our selection of pre-trained BLIP-2 is guided by specific criteria:

First, LLaVA uses a linear projection layer to project visual tokens from ViT and then feeds the projected tokens into LLMs. However, this linear projection layer keeps the visual tokens consistent, which means that this connector does not compress the visual token into fewer numbers. Although this redundant representation format does not meet the efficiency bottleneck on a single image as we extend the input modality to a single video containing multiple frames, it may exhaust the maximum token length capacity of an LLM. In contrast, BLIP-2 can reduce the number of tokens for each image/frame to a fixed number (e.g., 32). This efficient design avoids placing additional significant demands on the token length capacity of an LLM. Second, mini-GPT4, an instruction-tuned BLIP-2, also uses a linear projection layer to project visual tokens from ViT and then feeds the projected tokens into LLMs. Therefore, it also faces a similar limitation as LLaVA: when processing video frames, mini-GPT4’s LLM token capacity also quickly hits a forward-backward bottleneck, limiting the number of frames that can be effectively captioned. Third, while Flamingo is easily adapted to video data due to its cross-modal attention design, its open-source reproduction, OpenFlamingo, underperforms BLIP-2 according to \citetli2023blip’s experiments. Third, Flamingo’s design, which features cross-modal attention, facilitates its straightforward adaptation to video data; however, experiments conducted by \citetli2023blip imply that OpenFlamingo, an open-source version of Flamingo, does not perform as well as BLIP-2. Therefore, compared to LLaVA and mini-GPT4, BLIP-2 can be easily applied to video data to process multiple frames by averaging or concatenating the tokens of multiple frames (with a short length for the token of each frame, e.g. 32 tokens). We find that the BLIP-2 is characterized by its generality and simplicity, making it particularly well suited to the task of video captioning. Its design allows for minimal modification, allowing us to focus on the core factors that contribute to the effectiveness of video captioning models. This strategic choice is consistent with our goal of isolating and understanding the key elements that drive effective video captioning.

Appendix C Additional Experimental Details

C.1 Setup

Video Dataset Overview

Our study uses the MSR-VTT dataset \citepxu2016msr, a comprehensive open-domain video captioning resource. It includes 10,000 video clips across 20 different categories, with each clip annotated with 20 unique English sentences by contributors via Amazon Mechanical Turk. The dataset contains approximately 29,000 different words within the captions. For our experiments, we adhere to the conventional dataset partitioning: 6,513 clips for training, 497 for validation, and 2,990 for testing.

Training Configuration

Training is conducted on eight NVIDIA RTX A6000 GPUs, utilizing the MSR-VTT dataset. Optimization is performed using the AdamW algorithm, with a setup that includes a weight decline of 0.05, an initial learning rate of $5\times 10^{-5}$ , and a minimum learning rate of $1\times 10^{-5}$ . The models are trained with a batch size of 32 over 32 epochs, with learning rate adjustments governed by a cosine annealing scheduler.

C.2 Model Information

Our video captioning model uses the image pre-trained BLIP-2 as its foundation. The BLIP-2 model itself is initially trained from scratch using the MSCOCO \citeplin2014microsoft and CapFilt \citepli2022blip datasets, with additional data from the pseudo-labeled Conceptual Captioning \citepsharma2018conceptual, SBU \citepordonez2011im2text, and LAION \citepschuhmannlaion collections. Our study employs ViT (eva-vit-g released from \citepfang2022eva) due to its proven effectiveness. In the realm of LM decoders, we investigate the capabilities of OPT \citepzhang2022opt, Flan-T5 \citepchung2022scaling, and vicuna-7b \citepvicuna2023. To adapt BLIP-2 for video, we utilize bert-base-uncased for the q-former architecture, maintaining parameter consistency with the image-trained version of BLIP-2. Additionally, we implement a frame token concatenation mechanism for aggregating temporal information from videos without increasing the parameter count. We provide the detailed structures, pre-train data, and language backbones in Tab. C.2.

Model	# pretrain
image-text	#video-text	Vision
Backbone	Language
Backbone
IcoCap \citepliang2023icocap	-	-	CLIP-V	Transformer
MaMMUT \citepkuo2023mammut	1.8B	-	ViT	Transformer
VideoCoCa \citepyan2022videotext	3B	136M+8.7M	CoCa-V	CoCa-T
VALOR \citepchen2023valor	1.18M	1.18M	CLIP-V/VideoSwin	BERT
VLAB \citephe2023vlab	5M+12M	10.7M	ViT giant	Transformer
GIT2 \citepwang2022git	12.9B	-	CoSwin	Transformer
VAST [chen2023vast]	-	27M	ViT	BERT
mPLUG-2 \citepmplug2	14M	2.5M	ViT-L/14	BERT-L
Ours	129M	6K	EVA-ViT-G	Flan-T5-XL

Category	MSRVTT-QA	MSVD-QA
Module Trainability
All modules trainable	18.1	36.2
Unfreeze Q-former only	23.9	38.8
Freeze ViT only	22.5	38.5
RL to Human Standard
SCST Disabled	23.9	38.8
SCST Enabled	24.1	41.0
Pretrained Image-Text Pairs
129M	23.9	38.8
4M	18.8	36.2
Language Model Size
OPT-2.7B	16.5	35.7
FLAN-T5-XL-3B	23.9	38.8
Vicuna-7B	20.2	38.5

Pretrained Image-Text Models are Secretly Video Captioners

Abstract

1 Introduction

2 Recycling BLIP-2 for Video Captioning