This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

M-LLM Based Video Frame Selection for Efficient Video Understanding

Kai Hu1,    Feng Gao3    Xiaohan Nie3    Peng Zhou3    Son Tran3    Tal Neiman3    Son Tran3
Lingyun Wang3    Mubarak Shah2,3    Raffay Hamid3    Bing Yin3    Trishul Chilimbi3
Carnegie Mellon University 1   University of Central Florida 2   Amazon 3
kaihu@cs.cmu.edu1,∗ shah@crcv.ucf.edu2
fenggo,nxiaohan,zhoupz,sontran,taneiman,lingyunw,raffay,alexbyin,trishulc}@amazon.com3
This work is done during this author’s internship at Amazon.
Abstract

Recent advances in Multi-Modal Large Language Models (M-LLMs) show promising results in video reasoning. Popular Multi-Modal Large Language Model (M-LLM) frameworks usually apply naive uniform sampling to reduce the number of video frames that are fed into an M-LLM, particularly for long context videos. However, it could lose crucial context in certain periods of a video, so that the downstream M-LLM may not have sufficient visual information to answer a question. To attack this pain point, we propose a light-weight M-LLM-based frame selection method that adaptively select frames that are more relevant to users’ queries. In order to train the proposed frame selector, we introduce two supervision signals (i) Spatial signal, where single frame importance score by prompting a M-LLM; (ii) Temporal signal, in which multiple frames selection by prompting Large Language Model (LLM) using the captions of all frame candidates. The selected frames are then digested by a frozen downstream video M-LLM for visual reasoning and question answering. Empirical results show that the proposed M-LLM video frame selector improves the performances various downstream video Large Language Model (video-LLM) across medium (ActivityNet, NExT-QA) and long (EgoSchema, LongVideoBench) context video question answering benchmarks.

1 Introduction

Refer to caption
Figure 1: An example of our video frame selection for video QA. Compared to uniform sampling, our select has higher hit rate.

The rise of Large Language Model (LLM) has revolutionized numerous domains in Artificial Intelligence (AI[29, 3, 36, 39]. In the past year, Multi-Modal Large Language Model (M-LLM) has significantly improved the performances of Vision-Language Models (VLM) to an unprecedented level in tasks such as image captioning and Visual Question Answering (VQA[2, 12, 4, 41]. Video QA task requires a model to understand consecutive images along the time to answer a question, raises new challenges to the current M-LLM. One of most critical challenge is the length of the video context, where a model needs to comprehend all frames in a video.

To balance the trade-off between the capability of understand all video frames and the available context length in a specific M-LLM, conventional practices [48, 54] rely on uniform sampling of frames, wherein frames are extracted at pre-defined intervals regardless of their relevance to the specific questions. Although this method ensures the full coverage of a video in time-axis, it also introduces insufficient visual information, as many frames, which may be irrelevant or redundant are included, on the other some important frames are ignored. This dilutes the model’s ability to focus on key events in the video and sometimes increases the computational cost. Generally in video QA [52, 21], some specific frames are more likely to contain information pertinent to the question, this one-size-fits-all approach limits both the performance and practicality of M-LLMs. The need for adaptive frame selection is becoming important, especially when working with long videos or resource-limited environments.

To address these limitations, a most straightforward idea is to select frames instead of uniform sampling [50, 30]. By focusing on the frames that helps answer the question most, we can significantly reduce the visual context that an M-LLM needs to be processed without sacrificing the quality of video understanding. Illustrated as an example in Figure 1, frames that contains the most informative visuals can help downstream model to answer the question. To implement the above idea, we propose a light weight frame selector that employs fine-tuned version of an LLM. It makes use of M-LLM’s capability of multi-modality understanding to effectively capture the relationship between video frames and the question. We introduce two design choices to make the framework lightweight. First, we demonstrate that small LLMs are capable to understand complicated user questions. Second, we compress per video frame tokens to balance the long context trade-off.

Although video question-answering tasks often require a large number of tokens per frame to capture content details, we hypothesize that determining frame importance does not require excessive tokens. Instead of leveraging all visual tokens of the frames, we apply an aggressive pooling-based token reduction on each video frame, which significantly improves the computational efficiency and increases the number of frames that the LLM selector can ingest.

We also propose a method to train our LLM-based video frame selector. Given the fact that there are very limited resources of well-maintained video frame selection datasets for video QA, we are not able to simply apply supervised training. To overcome this limitation, we use two pseudo-labeling strategies to estimate the importance of each frame in a video. First is the spatial understanding, where we use a video QA question to prompt a well-trained M-LLM, which generates the importance score for each frame accordingly. However, the context length of an M-LLM is limited, it impossible to jointly estimate the importance scores of all frames from the temporal perspective. We then take advantage of an LLM to select top-k relevant frames in a video using their captions. Using captions rather than visual tokens, an LLM is able to understand a larger number of frames at a time. We combine these two approaches to obtain pseudo-labels indicating frame importance in given a specific question, enabling effective training of the frame selector.

Without fine-tuning the downstream M-LLM for question answering, our frame selector enhances the video QA performance by reducing the noise introduced by irrelevant frames and sharpens the model’s focus on pertinent video segments. Our proposed approach shows significant improvements with various popular M-LLMs in video question answering tasks across short, medium (ActivityNet, NExt-QA) and long (EgoSchema) context video QA benchmarks. To summarize, our contributions are three folds:

  • We propose a lightweight M-LLM-based adaptive video frame selector to for both efficient and stronger video QA performances of M-LLMs.

  • We propose spatial and temporal pseudo-labeling to generate importance scores for video frame select training.

  • Our proposed method is plug-and-play friendly. We demonstrates comprehensive video QA improvements across popular M-LLMs with further fine-tuning.

2 Related Work

Multi-Modal Large Language Model (M-LLM) As Large Language Model (LLM) continue to demonstrate impressive abilities in language comprehension and reasoning [1, 36, 3, 39], interest is growing within the computer vision community to explore their potential for handling multi-modal inputs. Flamingo [2] demonstrates the capacity to process image and text inputs for a wide range of multi-modal tasks. BLIP-2 [13] introduces a Q-Former to map learned image features into the text embedding space of LLMs, while LLaVA [20] employs simple Multi Layer Perceptron (MLP) projector to align visual and textual features. Further research has focused on best practices for M-LLM, including areas such as dynamic high-resolution [21, 6], instruction-tuning data [15, 23], and different visual encoders [38, 44]. MM1 [28] and Idefics2 [11] provide comprehensive ablation studies on the design space of M-LLM.

Refer to caption
Figure 2: An illustration of the conventional n-frame video M-LLM framework and our video M-LLM framework with frame selection.

Video M-LLMs As image-based M-LLM become more mature, research naturally extends to the video modality. Video-ChatGPT [26] and Valley [25] use pooling over video features to generate compact visual tokens for downstream M-LLM. Video-LLaVA [18] aligns images and videos before projection, allowing LLM to learn from a unified visual representation. Video-Teller [19] points out the importance of modality alignment in pre-training. PLLaVA [48] studies how different pooling of visual features affects downstream video question answering performance. Regarding long video inputs, LLaMA-VID [17] represents each frame with two tokens to reduce the overload of long videos while preserving critical information. MovieChat [34] propose an effective memory management mechanism to reduce the computation complexity and memory cost, enabling very long video with over 10K frames understanding. Weng et al. [45] proposes to extract video representations as sequences of short-term local features, and integrate global semantics into each short-term segment feature.

Video Frame Selection Before the rise of video M-LLM, language-aware video key-frame selection and localization had already attracted great interest [8, 22, 43, 7]. [5] optimized an end-to-end pipeline that uses ground truth question-answering labels to select a single key frame for downstream tasks. Lu et al. [24] uses a CLIP-like paradigm to train an additional transformer for visual and text feature alignment. Qian et al. [31] trains a video clip proposal model and the downstream QA model in a iterative manner respectively. Kim et al. [10] employs a semi-parametric retriever to obtain the key frames by comparing similarities between frame and language features.

Video M-LLM Frame Selection Several works propose to select key frames to improve video QA performance. SeViLA [50] prompts an M-LLM to obtain a relevance score to each frame, and use these scores to localize important frames. MVU [32] adopts a similar approach as SeViLA. However, a critical limitation of this kind of methods is the absence of temporal reasoning. Each frame is asserted independently without contextual information from other frames. Furthermore, this method produce expensive costs during inference. Every frame are required to estimated the importance score via a M-LLM. The longer the video the higher the cost. In contrast, our method outputs importance scores for all frames in a single pass with the compressed visual tokens, which reduces the computational costs and enables temporal reasoning. Koala [35] uses sparsely sampled key frames as conditions for processing subsequent visual tokens. ViLA [42] trains an end-to-end frame selection module to mask input frames. However, both of these methods lack text awareness: during inference, Koala’s subsequent visual token processing and ViLA’s frame masking do not account for the specific question posed about the video.

3 Method

This section introduces our video frame selector designed for efficient video-LLM QA. Section 3.1 discusses the motivation behind frame selection. Section 3.2 outlines the design details of the frame selector. Section 3.3 explains the generation of pseudo labels for training the frame selector. Section 3.4 describes the training process of the frame selector.

3.1 Rethinking Uniform Sampling in Video LLMs

A typical framework for video LLM   An n-frame framework is widely adopted in existing research on video M-LLM [14, 48, 54]. The number of frames in the input video, denoted by TT, is variable. For example, a 3-minute video at 30 frames per second (FPS) contains T=5400T=5400 frames. The n-frame framework uniformly samples a fixed number of frames, [x1,x2,,xn][x_{1},x_{2},\cdots,x_{n}], from the total TT frames, where each frame xiH×W×3x_{i}\in\mathbb{R}^{H\times W\times 3}, with H×WH\times W representing the frame resolution, and typically nTn\ll T. A pre-trained visual encoder fvf_{v} extracts visual features from n frames. These features are subsequently projected into the LLM’s space using an alignment projector gag_{a} and then flattened. Spatial pooling may also be applied to reduce the number of output tokens:

hi=AvgPooling(ga(fv(xi))),him×dh_{i}=\textit{AvgPooling}(g_{a}(f_{v}(x_{i}))),h_{i}\in\mathbb{R}^{m\times d} (1)

where mm is the number of tokens to represent a frame and dd is the hidden dimension of the LLM. Let Ql×dQ\in\mathbb{R}^{l\times d} denote the input embedding of the input text question. The n-frame framework generates a response rr as following:

r=LLM(h1,,hn,Q).r=\textit{LLM}(h_{1},\cdots,h_{n},Q). (2)

Uniform sampling is not optimal  In the n-frame framework, as shown in Figure 2 (a), the input video is represented by n×mn\times m tokens where mm is the number of visual tokens per frame. For example, in LLaVA-NeXT-Video [54], where n=32n=32 and m=12×12m=12\times 12, this results in 4608 tokens. To reduce the computational cost of LLM inference, previous work has either chooses to reduce nn with sliding Q-Former [14] or mm [48, 17] by spatial pooling. However, both of them ignore the importance of reducing the number of key frames. In other words, to adaptively reduce nn before encoding. nn could be large especially in long videos. Denser uniform sampling makes a larger nn, thereby reducing the efficiency of the video M-LLM. On the other hand, sampling fewer frames risks omitting crucial information. For example, sampling 32 frames from a 3-minute video means taking one frame every 6 seconds, potentially missing actions that occur within shorter time windows. In fact, most questions about a video can be answered using only a limited number of key frames. Inspired by this factor, we argue that adaptive frame selection according to question is more efficient and effective than uniform frame sampling.

3.2 Design of the Frame Selector

An ideal frame selector should be able to understand complex questions and analyze temporal relationships from the video input. To achieve this, we fine-tune an LLM to function as the frame selector. The frame selector utilizes the base LLM’s strong language comprehension and reasoning abilities to identify key frames in a video.

Similar to existing video M-LLM, our M-LLM-based frame selection method takes as inputs nn sampled video frames along with the corresponding text-based question. Instead of generating an answer to the question, the frame selector identifies the most relevant frames for answering the question. Specifically, we make use of well-trained decoder only LLM as the frame selector to output an nn-dimensional vector ss as follows:

s=FrameSelector(x1,,xn,Q)ns=\textit{FrameSelector}(x_{1},\cdots,x_{n},Q)\in\mathbb{R}^{n} (3)

where the ithi^{\text{th}} element in the vector ss indicates the importance score of the ithi^{\text{th}} input frame. To achieve this, we append a learnable score query q1×dq\in\mathbb{R}^{1\times d} to the end of all input tokens and use the concatenation of the visual tokens, text tokens and the score token as the LLM as input:

e1,,en,eQ,eq=LLM(x1,,xn,Q,qscore)e_{1},\cdots,e_{n},e^{Q},e^{q}=\textit{LLM}(x_{1},\cdots,x_{n},Q,q_{score}) (4)

where eie_{i} and eqe^{q} denote the intermediate output of xix_{i} and qscoreq_{score} from the penultimate transformer block. Because the nature of causal attention, eqe^{q} aggregates information from all visual and text tokens. We then employ an MLP to exact frame importance information from the intermediate output of the score query from the penultimate transformer block:

s=MLP(eq),sns=\textit{MLP}(e^{q}),s\in\mathbb{R}^{n} (5)

Figure 2 (b) presents an overview of the proposed architecture for the M-LLM-based frame selector. Instead of generating tokens that represents the frame selection, we append a learnable query vector at the end of the input sequence and supervisedly learn this query token. The hidden vector of this query token, eqe^{q}, is then served as the input to generate the nn-dimensional importance vector ss.

Algorithm 1 Greedy NMS sampling
1:Input: Importance score sns\in\mathbb{R}^{n}, Number of frames selected kk
2:Initialize the neighbor gap δint(n/4k)\delta\leftarrow\text{int}(\nicefrac{{n}}{{4k}}).
3:Initialize the selected index list Is=[]I_{s}=[]. \Forstep in 1,k1\cdots,k
4:(Greedy) Selected frame index iargmaxs.i\leftarrow\arg\max s. Append index ii to IsI_{s}.
5:(NMS) Update the importance score:
6:s[j]=1if|ij|δs[j]=-1\;\text{if}\;|i-j|\leq\delta \EndFor
7:Issort(Is)I_{s}\leftarrow\text{sort}(I_{s})
8:Return: IsI_{s}
Refer to caption
Figure 3: An illustration of the spatial and temporal pseudo labeling for the importance scores

Select frames from the importance score After obtaining per-frame importance scores, we need to sample kk frames for downstream video question-answering via a video M-LLM. Naively selecting the top kk frames with the highest importance scores is suboptimal because neighboring frames within short time intervals often have similar scores. For example, if frame ii has the highest importance score, the i+1i+1 or i1i-1 frames usually have a closely high scores. The adjacent frame adds little additional information if frame ii is already selected. To address this issue, we use a greedy algorithm combined with non-maximum suppression (NMS) to select the most informative frames. The “Greedy” approach involves sequentially selecting the top k frames, without replacement, by choosing the frame with the highest importance score from the remaining set of frames. With “NMS”, once a frame is selected, its neighboring frames are excluded as they contain similar information. The “NMS-Greedy” procedure is detailed in Algorithm 1. In practice, when selecting kk frames from nn frames, frame jj is considered a neighboring frame of frame ii if |ij|(n/4k)|i-j|\leq(\nicefrac{{n}}{{4k}}).

Efficiency of the frame selector  As discussed in Section 3.1, dense uniform sampling is inefficient for video M-LLM. However, we still employ dense uniform sampling for the frame selector.

Although dense uniform sampling is inefficient for video M-LLM, we keep using it at the beginning to maximally preserve the video information. To overcome the drawback of dense uniform sampling, we apply spatial pooling to reduce the token count per video frame before feeding the frames into the selector. In particular, the spatial pooling reduce the number of visual token to a smaller value, e.g., m=3×3m=3\times 3 (9 tokens), which is substantially fewer than the tokens per frame used in existing video LLMs, e.g., m=12×12m=12\times 12 (144 tokens). This design follows such an assumption: for video question answering, the model requires a substantial number of tokens per frame to capture visual details, but far fewer tokens are needed to determine whether a frame is important. A rough outline of the frame is sufficient. Our empirical results Table 6 also justify this assumption.

3.3 Pseudo Labels for the Frame Selector

To train the frame selector, we need supervision signals for the output importance scores sns\in\mathbb{R}^{n}. Unfortunately, there is no existing dataset to label the frame level importance score for video QA, we propose two methods to generate pseudo labels for training the frame selector.

Spatial Pseudo Labels Previous works use M-LLMs to evaluate whether a video frame is relevant to a question [50, 32]. A common practice is to prompt an M-LLM and ask if the video frame provides useful information to answer the question. The relevant score of the frame is the probability that the M-LLM generates the “Yes” token. In our experiments, we observe that this method does not always provide a reasonable estimation. Even if the model agrees the frame is relevant, the M-LLM may not generate the “Yes” token but other expressions depending on its text generation style. To address this issue, we apply chain-of-thoughts (CoT), asking the M-LLM to explain first then generate a Boolean evaluation (check appendix for the detailed prompt). This prompts allows the M-LLM to improve the evaluation quality by extra inference time reasoning. An ideal M-LLM should adhere to the instruction by generating either “True” or “False”. In a few cases the model may fail to follow the instruction, and we manually append the text “Evaluation: True” to the end of the generated response. Thus we can always compute the probabilities of generating “True” and “False”, denoted as pTruep_{\text{True}} and pFalsep_{\text{False}} respectively. The importance score of the input frame is determined by

s=pTrue/(pTrue+pFalse)s=p_{\text{True}}/(p_{\text{True}}+p_{\text{False}}) (6)

In our experimental setup, we uniformly sample n=128n=128 frames from the video and obtain the spatial pseudo labels for each frame independently. Let sis_{i} denote the score for the ithi^{\text{th}} frame, we normalize the score vector as si/maxjsj\nicefrac{{s_{i}}}{{\max_{j}s_{j}}}. Figure 3 (a) shows the pipeline.

Temporal Pseudo Labels A significant limitation of single-frame evaluation is the lack of temporal reasoning. For example, considering the question: “What did the man do after picking up his hat?”, the video content following the action of picking up the hat is crucial for answering this question. However, when generating the spatial pseudo labels, it only consider one frame without taking care of the temporal context. Therefore, the spatial labels don’t know what occur after the action.

To address this issue, we propose the temporal pseudo labeling. Since most of the publicly available M-LLMs are not able to consume a large number of image tokens, we alter to take advantages of the frame captions and use a LLM to reason over all captions. Specifically, we first obtain detail captions of all nn frames by prompting a M-LLM. Second, we feed the captions of all frames together as well as the question to a strong text-only LLM. Then the LLM can temporally reason the helpfulness of all frames.

We find it challenging to generating floating-point scores for an extensive list of frames for an LLM. Consequently, we ask the model to produce a list of the index of most helpful frames (see appendix for the detailed prompt). Frames included in this list are assigned a score of 1, while those excluded receive a score of 0. Figure 3 (b) shows the pipeline.

While temporal pseudo labels can capture temporal relations in videos, it may suffer from information loss and model hallucination due to its two-stage evaluation process. Therefore, we combine two methods into the final pseudo-labels by averaging the scores obtained from spatial and temporal pseudo labels.

3.4 Training of the Frame Selector

We consider a two-stage instruction-tuning procedure. In stage 1, we freeze the pre-trained vision and LLM backbones and train the parameters of the alignment projector gag_{a}, the learned score query XscoreX_{\text{score}} and the score projector gsg_{s}. Figure 2 (b) shows the trainable modules in the red boxes and the frozen modules in the blue boxes. Stage 1 training is optimized over the two tasks alternatively. We use the following two tasks:

  • Visual instruction tuning Recall rr in Equation 2 is the generated response from the LLM, the objective is the cross entropy loss between rr and the ground truth response. This tasks train the projector gag_{a} to aligns the visual features with the pre-trained LLM embedding space.

  • Importance score prediction Recall ss in Equation 5 is the output importance score for nn frames, the objective is the binary cross entropy loss between ss and the pseudo labels generated from Section 3.3. This task provide a good initialization for the score query and the score projector.

In stage 2, we only train the model with the importance score prediction task. Besides the alignment projector gag_{a}, the learned score query qq and the score projector gsg_{s}, we also include the Low Rank Adaptation (LoRA) weights of the LLM as the trainable parameters to adapt the LLM to the frame selection task.

4 Experiments

We begin by outlining our experimental setup in Section 4.1. Then we demonstrate that our frame selector improves the performance of well-trained M-LLMs without changing their parameters by selecting better frames in Section 4.2. We also conduct ablation studies to demonstrate the effectiveness and efficiency of our frame selection framework and results are in Section 4.3. At last, we showcase some qualitative examples of frames selected from the video according to the question in Section 4.4.

4.1 Experiment Setup

Training data We compile the training dataset from three sources: 1) 800K data from VideoChat2 [16], 2) 125K data from the TimeIT dataset [33], and 3) 178K data from the LLaVA-Video-178K dataset [56]. For the visual instruction tuning task, we use the entire training dataset. For the importance score prediction task, we use 400K video QA data where the video length exceeds 5 seconds.

Pseudo label generation we utilize Qwen2-VL-7B [41] to generate importance scores for each frame in the spatial pseudo labeling and concise captions for all frames in the temporal pseudo labeling. We use GPT-4o mini to propose the most helpful frames given the concise captions in the multi-frame evaluation.

Evaluation benchmarks Since our framework selects frames for video question answering, we evaluated its performance on benchmarks consisting of relatively longer videos, including open-ended video QA on ActivityNet-QA [51], and multi-choice QA on NExT-QA [47] and EgoSchema [27], LongVideoBench [46].

Implementation details We use the pre-trained SigLIP ViT-Large [53] as the visual encoder and Qwen2.5 1.5B [37] as the backbone of the LLM. We uniformly sample 128 frames from the video as the input for the visual encoder. The output from the visual encoder have a size of 128×16×16128\times 16\times 16. After the alignment projector and the spatial pooling layer, the size of the visual tokens is 128×3×3128\times 3\times 3. In Stage 1, we use a batch size of 128, a learning rate of 10310^{-3}, and a warm-up ratio of 0.03 to train the model for two epochs. In Stage 2, a batch size of 128, a learning rate of 10510^{-5}, a warm-up ratio of the first 3% iterations, and a cosine learning rate scheduler are used to train the model for five epochs.

Model Model Size ActivityNet-QA
Video-ChatGPT [26] 7B 35.2 / 2.8
Chat-UniVi [9] 7B 46.1 / 3.3
LLaMA-VID [17] 7B 47.4 / 3.3
LLaMA-VID [17] 13B 47.5 / 3.3
Video-LLaVA [18] 7B 45.3 / 3.3
MiniGPT4-Video [4] 7B 46.3 / 3.4
SlowFast-LLaVA [49] 7B 55.5 / 3.4
SlowFast-LLaVA [49] 34B 59.2 / 3.5
Tarsier [40] 7B 59.5 / 3.6
Tarsier [40] 34B 61.6 / 3.7
PLLaVA [48] 7B 56.3 / 3.5
PLLaVA [48] 34B 60.9 / 3.7
LLaVA-NeXT-Video [55] 7B 53.5 / 3.2
LLaVA-NeXT-Video [55] 34B 58.8 / 3.4
PLLaVA + Selector 7B + 1.5B 57.6 (1.3\uparrow) / 3.5
PLLaVA + Selector 34B + 1.5B 62.3 (1.4\uparrow) / 3.6
LLaVA-NeXT-Video + Selector 7B + 1.5B 55.1 (1.6\uparrow) / 3.4
LLaVA-NeXT-Video + Selector 34B + 1.5B 60.2 (1.4\uparrow) / 3.5
Table 1: Comparison on open-ended question answering evaluation ActivityNet QA. Results with the “+ Selector” are ours.

4.2 Comparison with SOTA Video-LLMs

We choose two strong video M-LLMs, PLLaVA [48] and LLaVA-NeXT-video [55] and two (multi-)image based M-LLM Idefics [11] and Qwen2-VL [41], to be the baselines to illustrate how our frame selector enhances the video question-answering (QA) performance of these models. For each model, we compare the performance of using uniformly sampled frames as inputs versus using frames selected by our frame selector. The number of frames seen by the video M-LLM is the same during the comparison.

Model Model Size NExT-QA
SlowFast-LLaVA [49] 7B 64.2
SlowFast-LLaVA [49] 34B 72.0
Tarsier [40] 7B 71.6
Tarsier [40] 34B 79.2
LLaVA-NeXT-Video [55] 7B 62.4
LLaVA-NeXT-Video [55] 34B 68.1
Idefics2 [11] 8B 68.0
Qwen2-VL [41] 7B 77.6
LLaVA-NeXT-Video + Selector 7B + 1.5B 63.4 (1.0\uparrow)
LLaVA-NeXT-Video + Selector 34B + 1.5B 69.3 (1.2\uparrow)
Idefics2 + Selector 8B + 1.5B 69.1 (1.1\uparrow)
Qwen2-VL + Selector 7B + 1.5B 78.4 (0.8\uparrow)
Table 2: Comparison on multi-choice question answering evaluation NExT-QA. Results with the “+ Selector” are ours.
Model Model Size EgoSchema
SlowFast-LLaVA [49] 7B 47.2
SlowFast-LLaVA [49] 34B 55.8
Tarsier [40] 7B 56
Tarsier [40] 34B 68.6
LLaVA-NeXT-Video [55] 7B 45.8
LLaVA-NeXT-Video [55] 34B 48.6
Idefics2 [11] 8B 56.6
Qwen2-VL [41] 7B 64.6
LLaVA-NeXT-Video + Selector 7B + 1.5B 47.2 (1.3\uparrow)
LLaVA-NeXT-Video + Selector 34B + 1.5B 50.6 (2.0\uparrow)
Idefics2 + Selector 8B + 1.5B 57.9 (1.3\uparrow)
Qwen2-VL + Selector 7B + 1.5B 65.9 (1.1\uparrow)
Table 3: Comparison on multi-choice question answering evaluation EgoSchema. Results with the “+ Selector” are ours.

Table 1, Table 2 and Table 3 present the comparison on ActivityNet-QA [51], NExT-QA [47] and EgoSchema [27] respectively. Following existing practice [26], performance on ActivityNet-QA is measured as “accuracy/correctness” metrics, and both metrics are evaluated using GPT-3.5, with higher values indicating better performance. Performances on NExT-QA and EgoSchema are the accuracy of of multi-choice questions where each question has 5 options. We prefill the word “Option” as the initial token to generate, and then use the next generated token as the prediction result.

4.3 Ablation Studies

We perform ablation studies on several components of the frame selection framework. First, we analyze the performance of the following frame selection methods on the video QA task:

  • Uniform sampling: the default uniform sampling

  • Pseudo labels from CLIP similarity: define the importance score of a frame as the image-text similarity between the frame and the text question, computed using a CLIP model. Then sample frames using Algorithm 1.

  • Pseudo labels from SeViLA: define the importance score of a frame as in SeViLA [50] to sample frames.

  • Spatial pseudo labels: define the importance score using the spatial pseudo labels only.

  • Spatial & temporal pseudo: define the importance score as the average of the spatial and temporal pseudo labels.

  • Trained selector: define the importance score of a frame using the output of our trained frame selector.

Selection Method ANet-QA NExT-QA
Uniform sampling 53.5 62.4
Scores from CLIP similarity 53.7 62.2
Pseudo labels from SeViLA 54.0 63.2
Spatial pseudo labels 54.2 63.6
Spatial & temporal pseudo 55.5 63.9
Scores from trained selector 55.1 63.4
Table 4: Performance of LLaVA-NeXT-Video 7B on ActivityNet (ANet) and NExT QA with different frame selection methods

Table 4 presents the video QA performance of LLaVA-NeXT-Video 7B on two benchmarks, respectively. Uniform and CLIP sampling serve as simple baselines for frame selection, while other methods utilize M-LLM reasoning. Both SeViLA and our Spatial are single-frame-based selection methods, with the latter achieving superior performance due to enhanced reasoning during multimodal LLM inference. Spatial & Temporal further improves upon Spatial, demonstrating the importance of temporal reasoning in frame selection. However, the computational cost to generate such pseudo labels are extremely high as it needs to prompt an M-LLM densely. The light-weight selector’s performance matches the performance of frame selection using pseudo labels, validating the effectiveness of the selector architecture.

# frames Acc@C Acc@T Acc@D Acc Speed (s)
44 67.2 61.2 73.9 66.4 0.56
88 68.7 62.5 76.9 68.1 0.92
1616 69.1 63.6 76.8 68.7 1.71
3232 69.5 64.3 78.4 69.3 3.40
1284128\rightarrow 4 68.5 64.5 75.7 68.5 0.76
1288128\rightarrow 8 69.3 64.9 77.5 69.3 1.12
12816128\rightarrow 16 69.4 64.8 78.5 69.5 1.91
12832128\rightarrow 32 69.2 65.6 78.7 69.6 3.50
Table 5: Performance and inference speed of LLaVA-NeXT-Video 34B on NExT-QA with different number of input frames. First 4 rows: uniform sampling, last 4 rows: sampling using the selector.

We further show that the video QA system can use fewer frames to reason videos with the help of our frame selector. Table 5 shows the performance of LLaVA-NeXT-Video on NExT-QA taking different number of frames as input and the inference speed of LLaVA-NeXT-Video with different number of input frames. The inference speed was measured with float16 precision using a batch size of 1 on a single A100 GPU, employing the Hugging Face implementation. When taking the same number of frames as input, our framework incurs additional inference costs to select frames using the M-LLM selector. However, this increase in inference time is not significant thanks to the efficient design of the frame selector. Moreover, the frames selected by the M-LLM selector are more useful for answering the question. Therefore, the model using a selector with n{n}-frame input can achieve similar video QA performance as the model without a selector using 2n{2n}-frame input. For example, the configure 1284128\rightarrow 4 outperforms 8-frame uniform sampling with a faster inference speed.

# tokens / frame ActivityNet-QA NExT-QA EgoSchema
no selector 53.5 62.4 45.8
1 53.2 62.7 46.6
9 55.1 63.4 47.2
25 55.3 63.6 47.3
Table 6: Performance of LLaVA-NeXT-Video 7B with different number of tokens per frame used in the selector.
Backbone size ActivityNet-QA NExT-QA EgoSchema
no selector 53.5 62.4 45.8
0.5 B 53.8 62.8 46.4
1.5 B 55.1 63.4 47.2
7 B 55.5 64.0 47.9
Table 7: Performance of LLaVA-NeXT-Video 7B with different size of the selector’s base LLM.

As discussed in Section 3.2, one advantage of our framework is that the frame selector features a lightweight design. We examines two key hyperparameters that influence the computational efficiency: the number of tokens to represent a frame and the size of the base LLM. By default, we use Qwen2.5 1.5b as the LLM backbone and use 9 tokens for video frame representation. Table 6 and Tabel 7 present the ablation studies examining these two factors. “no selector” indicates sample the video frames uniformly.

We further demonstrate the effectiveness of our frame selector for long video question answering (QA). LongVideoBench [45] is a recently introduced benchmark for long-context video-language understanding, with an average video duration of 473 seconds. In Table 8, we report the performance of LLaVA-Next-Video 34B and Qwen2-VL 7B with different numbers of input frames, using uniform sampling and sampling with our frame selector. For both M-LLMs, the performance with nn input frames sampled using the selector surpasses that of 2n2n input frames with uniform sampling, demonstrating the effectiveness of the frame selector.

model # frames Uniform Selector
LLaVA-Next Video 34B 4 45.3 49.5
8 46.9 49.9
16 48.1 49.8
32 49.7 50.0
Qwen2-VL 7B 4 48.0 55.0
8 50.9 56.0
16 53.6 56.5
32 53.3 57.0
Table 8: Performance of LLaVA-NeXT-Video 34B Qwen2-VL 7B on LongVideoBench with different input frames.

4.4 Visualization of selected frames

Figure 4 presents two examples of selected frames from the video conditioned the question. Each question involves two events and thus requires temporal reasoning. Estimating the importance of a single frame is challenging without reference to prior or subsequent frames. The frame selector effectively identifies frames containing the answers to the questions.

Refer to caption
Refer to caption
Figure 4: Visualization of the frame selection results.

5 Conclusion

In this paper, we propose a lightweight M-LLM-based frame selector to improve both performances and efficiency in video QA. This selector is question-aware and takes dense video frames and the question as input, selecting the most relevant frames for answering the question. We can then use any multi-image or video M-LLMs with the selected frames to complete the question answering. To train the frame selector, we introduce spatial and temporal pseudo-labeling due to the limited public annotations for video frame-level importance. Our experiments on two medium-length video QA benchmarks (ActivityNet QA and NExt-QA) and two long-video QA benchmarks (EgoSchema and LongVideoBench) demonstrate the effectiveness of our proposed method.

References

  • Achiam et al. [2023] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report. arXiv:2303.08774, 2023.
  • Alayrac et al. [2022] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  • Anthropic [2024] Anthropic. The claude 3 model family: Opus, sonnet, haiku. https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf, 2024. Accessed: 2024-09-18.
  • Ataallah et al. [2024] Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Essam Sleiman, Deyao Zhu, Jian Ding, and Mohamed Elhoseiny. Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens. arXiv preprint arXiv:2404.03413, 2024.
  • Buch et al. [2022] Shyamal Buch, Cristóbal Eyzaguirre, Adrien Gaidon, Jiajun Wu, Li Fei-Fei, and Juan Carlos Niebles. Revisiting the” video” in video-language understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2917–2927, 2022.
  • Chen et al. [2024] Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024.
  • Cui et al. [2022] Ran Cui, Tianwen Qian, Pai Peng, Elena Daskalaki, Jingjing Chen, Xiaowei Guo, Huyang Sun, and Yu-Gang Jiang. Video moment retrieval from text queries via single frame annotation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1033–1043, 2022.
  • Gao et al. [2023] Difei Gao, Luowei Zhou, Lei Ji, Linchao Zhu, Yi Yang, and Mike Zheng Shou. Mist: Multi-modal iterative spatial-temporal transformer for long-form video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14773–14783, 2023.
  • Jin et al. [2024] Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation empowers large language models with image and video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13700–13710, 2024.
  • Kim et al. [2023] Sungdong Kim, Jin-Hwa Kim, Jiyoung Lee, and Minjoon Seo. Semi-parametric video-grounded text generation. arXiv preprint arXiv:2301.11507, 2023.
  • Laurençon et al. [2024] Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision-language models? arXiv preprint arXiv:2405.02246, 2024.
  • Li et al. [2024a] Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024a.
  • Li et al. [2023a] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023a.
  • Li et al. [2023b] KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023b.
  • Li et al. [2023c] Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. MVBench: A comprehensive multi-modal video understanding benchmark. arXiv:2311.17005, 2023c.
  • Li et al. [2024b] Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024b.
  • Li et al. [2025] Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. In European Conference on Computer Vision, pages 323–340. Springer, 2025.
  • Lin et al. [2023] Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023.
  • Liu et al. [2023a] Haogeng Liu, Qihang Fan, Tingkai Liu, Linjie Yang, Yunzhe Tao, Huaibo Huang, Ran He, and Hongxia Yang. Video-teller: Enhancing cross-modal generation with fusion and decoupling. arXiv preprint arXiv:2310.04991, 2023a.
  • Liu et al. [2023b] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023b.
  • Liu et al. [2024a] Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. LLaVA-NeXT: Improved reasoning, ocr, and world knowledge, 2024a.
  • Liu et al. [2022] Yuqi Liu, Pengfei Xiong, Luhui Xu, Shengming Cao, and Qin Jin. Ts2-net: Token shift and selection transformer for text-video retrieval. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XIV, pages 319–335. Springer, 2022.
  • Liu et al. [2024b] Yuan Liu, Zhongyin Zhao, Ziyuan Zhuang, Le Tian, Xiao Zhou, and Jie Zhou. Points: Improving your vision-language model with affordable strategies. arXiv preprint arXiv:2409.04828, 2024b.
  • Lu et al. [2022] Haoyu Lu, Mingyu Ding, Nanyi Fei, Yuqi Huo, and Zhiwu Lu. LGDN: Language-guided denoising network for video-language modeling. In Advances in Neural Information Processing Systems, 2022.
  • Luo et al. [2023] Ruipu Luo, Ziwang Zhao, Min Yang, Junwei Dong, Da Li, Pengcheng Lu, Tao Wang, Linmei Hu, Minghui Qiu, and Zhongyu Wei. Valley: Video assistant with large language model enhanced ability. arXiv preprint arXiv:2306.07207, 2023.
  • Maaz et al. [2024] Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024), 2024.
  • Mangalam et al. [2023] Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding. Advances in Neural Information Processing Systems, 36:46212–46244, 2023.
  • McKinzie et al. [2024] Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, et al. MM1: Methods, analysis & insights from multimodal llm pre-training. arXiv:2403.09611, 2024.
  • OpenAI [2022] TB OpenAI. Chatgpt: Optimizing language models for dialogue. OpenAI, 2022.
  • Park et al. [2024] Jongwoo Park, Kanchana Ranasinghe, Kumara Kahatapitiya, Wonjeong Ryoo, Donghyun Kim, and Michael S Ryoo. Too many frames, not all useful: Efficient strategies for long-form video qa. arXiv preprint arXiv:2406.09396, 2024.
  • Qian et al. [2022] Tianwen Qian, Ran Cui, Jingjing Chen, Pai Peng, Xiaowei Guo, and Yu-Gang Jiang. Locate before answering: Answer guided question localization for video question answering. In IEEE transactions on multimedia, 2022.
  • Ranasinghe et al. [2024] Kanchana Ranasinghe, Xiang Li, Kumara Kahatapitiya, and Michael S Ryoo. Understanding long videos in one multimodal language model pass. arXiv preprint arXiv:2403.16998, 2024.
  • Ren et al. [2024] Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large language model for long video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14313–14323, 2024.
  • Song et al. [2023] Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Xun Guo, Tian Ye, Yan Lu, Jenq-Neng Hwang, et al. MovieChat: From dense token to sparse memory for long video understanding. arXiv:2307.16449, 2023.
  • Tan et al. [2024] Reuben Tan, Ximeng Sun, Ping Hu, Jui-hsien Wang, Hanieh Deilamsalehy, Bryan A Plummer, Bryan Russell, and Kate Saenko. Koala: Key frame-conditioned long video-llm. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13581–13591, 2024.
  • Team et al. [2023] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv:2312.11805, 2023.
  • Team [2024] Qwen Team. Qwen2.5: A party of foundation models, 2024.
  • Tong et al. [2024] Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. arXiv preprint arXiv:2406.16860, 2024.
  • Touvron et al. [2023] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288, 2023.
  • Wang et al. [2024a] Jiawei Wang, Liping Yuan, and Yuchen Zhang. Tarsier: Recipes for training and evaluating large video description models. arXiv preprint arXiv:2407.00634, 2024a.
  • Wang et al. [2024b] Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024b.
  • Wang et al. [2024c] Xijun Wang, Junbang Liang, Chun-Kai Wang, Kenan Deng, Yu Lou, Ming C Lin, and Shan Yang. Vila: Efficient video-language alignment for video question answering. In European Conference on Computer Vision, pages 186–204. Springer, 2024c.
  • Wang et al. [2022] Zixu Wang, Yujie Zhong, Yishu Miao, Lin Ma, and Lucia Specia. Contrastive video-language learning with fine-grained frame sampling. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 694–705. Association for Computational Linguistics, 2022.
  • Wei et al. [2025] Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, Jinrong Yang, Jianjian Sun, Chunrui Han, and Xiangyu Zhang. Vary: Scaling up the vision vocabulary for large vision-language model. In European Conference on Computer Vision, pages 408–424. Springer, 2025.
  • Weng et al. [2025] Yuetian Weng, Mingfei Han, Haoyu He, Xiaojun Chang, and Bohan Zhuang. Longvlm: Efficient long video understanding via large language models. In European Conference on Computer Vision, pages 453–470. Springer, 2025.
  • Wu et al. [2024] Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding. arXiv preprint arXiv:2407.15754, 2024.
  • Xiao et al. [2021] Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021.
  • Xu et al. [2024a] Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, and Jiashi Feng. PLLaVA: Parameter-free llava extension from images to videos for video dense captioning. arXiv:2404.16994, 2024a.
  • Xu et al. [2024b] Mingze Xu, Mingfei Gao, Zhe Gan, Hong-You Chen, Zhengfeng Lai, Haiming Gang, Kai Kang, and Afshin Dehghan. Slowfast-llava: A strong training-free baseline for video large language models. arXiv preprint arXiv:2407.15841, 2024b.
  • Yu et al. [2024] Shoubin Yu, Jaemin Cho, Prateek Yadav, and Mohit Bansal. Self-chained image-language model for video localization and question answering. Advances in Neural Information Processing Systems, 36, 2024.
  • Yu et al. [2019a] Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 9127–9134, 2019a.
  • Yu et al. [2019b] Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. ActivityNet-QA: A dataset for understanding complex web videos via question answering. In AAAI, 2019b.
  • Zhai et al. [2023] Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. arXiv preprint arXiv:2303.15343, 2023.
  • Zhang et al. [2024a] Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava-next: A strong zero-shot video understanding model, 2024a.
  • Zhang et al. [2024b] Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava-next: A strong zero-shot video understanding model, 2024b.
  • Zhang et al. [2024c] Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data, 2024c.
\thetitle

Supplementary Material

Appendix A Prompt for Pseudo Label Generation

Table 9 provides the prompt template for generating pseudo spatial labels. We uniformly sample n=128n=128 frames and use the prompt template to obtain a score for each frame independently. We use the logit to generate the word “True” or “False” after “Evaluation:” to compute the score using Equation 6. In a few cases, the M-LLM response may not follow the instruction and does not contain the text “Evaluation: True” or “Evaluation: False”. We manually add ‘Evaluation: True” to the end of the response and use the logit to generate the word “True” to compute the score.

The image is a video frame from a video. A question
about the video is:
{question}
Evaluate whether the video frame provides useful
information to answer this question about the video.
First explain your reasoning. Then generate a
Boolean evaluation of the frame’s usefulness. For
example:
Evaluation: True
Table 9: Prompt template for spatial pseudo labels

Table 10 provides the prompt template for generating pseudo temporal labels. We first use the M-LLM to generate a concise caption for n=128n=128 uniformly sampled frames. Then we use the prompt in Table 10 to generate a list of frame indexes containing the most helpful frames.

I need to answer a question based on a long video. To
do this, I have uniformly sampled 128 frames from
the video, each with a corresponding caption. Below is
the list of frames and their captions:
Frame 1 : {caption1}
Frame 2 : {caption2}
\cdots
Frame 128 : {caption128}
The question I need to answer is:
{question}
Please provide a list of 8 frames that would be most
helpful for answering this question.
Rule: ONLY provide a Python List without extra text.
Table 10: Prompt template for temporal pseudo labels
M-LLM ANet-QA NExT-QA
No pseudo-labels 53.5 62.4
LLaVA-NeXT 7B 53.9 62.8
Idefics2 8B 53.8 63.2
Qwen2 VL 7B 54.2 63.6
Table 11: Performance of LLaVA-NeXT-Video 7B on ActivityNet (ANet) and NExT QA with different spatial pseudo-labels
M-LLM EgoSchema LongVideoBench
Uniform 4 frames 45.8 45.3
16416\rightarrow 4 47.8 46.0
32432\rightarrow 4 48.2 48.9
1284128\rightarrow 4 49.0 49.5
Table 12: Performance of selecting different number of frames on EgoSchema and LongVideoBench with LLaVA-NeXT-Video 34B as downstream video-LLM.

Appendix B Additional Results

Pseudo Label Geneation with different M-LLM

Qwen2-VL serves as the prompting M-LLM for spatial pseudo-labels generation as detailed in Section 3.3. We investigate the influence of alternative M-LLMs on video QA performance. Table 11 compares the performance of LLaVA-Next-Video 7B on ActivityNet and NExt-QA using frames selected based on spatial pseudo-labels generated by different prompting M-LLMs. A stronger M-LLM produces higher-quality pseudo-labels.

Number of frames before selection

Existing frame selection methods [50, 32] typically sample 163216\sim 32 frames from a video and then perform frame selection on these frames. In contrast, our method samples a significantly larger list of 128 frames prior to the frame selection process. We posit that a larger number of frames is essential for long video QA. To evaluate this, we assessed the video QA performance of LLaVA-NeXT-Video 34B taking 4 frames selected from different number of frames. Table 12 summarizes the results on the long-video QA benchmarks EgoSchema and LongVideoBench. The improvement on QA performance from 16416\rightarrow 4 to 1284128\rightarrow 4 is significant, showing the necessity of have a large frame selection candidate pool.

Appendix C More Visualization Results

Figure 5 and Figure 6 are the zoom-in for Figure 4. Figure 7 and Figure 8 are additional visualization results.

Refer to caption
Figure 5: One visualization example of the frame selection results.
Refer to caption
Figure 6: One visualization example of the frame selection results.
Refer to caption
Figure 7: One visualization example of the frame selection results.
Refer to caption
Figure 8: One visualization example of the frame selection results.