Frame-Voyager: Learning to Query Frames for Video Large Language Models

📅 2024-10-04
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video-LLMs face inherent limitations in processing full-length videos due to strict input-length constraints; existing frame-sampling strategies—such as uniform sampling or text-guided retrieval—fail to account for dynamic variations in video information density and task-specific instruction requirements, resulting in suboptimal performance. To address this, we propose the first end-to-end learnable frame-query framework, wherein the Video-LLM’s own prediction loss serves as the supervisory signal. Our method introduces a task-aware, density-adaptive frame-combination ranking annotation paradigm, integrating pre-trained model distillation, combinatorial space traversal, and ranking-based learning for plug-and-play integration. Evaluated on four video question-answering benchmarks, our approach achieves significant performance gains and demonstrates strong cross-model generalization. Notably, it establishes the first transferable, dynamic frame-selection mechanism—enabling adaptive frame selection that generalizes across diverse Video-LLM architectures.

Technology Category

Application Category

📝 Abstract
Video Large Language Models (Video-LLMs) have made remarkable progress in video understanding tasks. However, they are constrained by the maximum length of input tokens, making it impractical to input entire videos. Existing frame selection approaches, such as uniform frame sampling and text-frame retrieval, fail to account for the information density variations in the videos or the complex instructions in the tasks, leading to sub-optimal performance. In this paper, we propose Frame-Voyager that learns to query informative frame combinations, based on the given textual queries in the task. To train Frame-Voyager, we introduce a new data collection and labeling pipeline, by ranking frame combinations using a pre-trained Video-LLM. Given a video of M frames, we traverse its T-frame combinations, feed them into a Video-LLM, and rank them based on Video-LLM's prediction losses. Using this ranking as supervision, we train Frame-Voyager to query the frame combinations with lower losses. In experiments, we evaluate Frame-Voyager on four Video Question Answering benchmarks by plugging it into two different Video-LLMs. The experimental results demonstrate that Frame-Voyager achieves impressive results in all settings, highlighting its potential as a plug-and-play solution for Video-LLMs.
Problem

Research questions and friction points this paper is trying to address.

Overcoming token length limits in Video-LLMs
Improving frame selection for video understanding
Enhancing performance in Video Question Answering tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learns to query informative frame combinations
Uses pre-trained Video-LLM for frame ranking
Plug-and-play solution for Video-LLMs
🔎 Similar Papers
S
Sicheng Yu
Bytedance
C
Chengkai Jin
Bytedance
H
Huanyu Wang
Bytedance
Z
Zhenghao Chen
Bytedance
S
Sheng Jin
Bytedance
Z
Zhongrong Zuo
Bytedance
X
Xiaolei Xu
Bytedance
Z
Zhenbang Sun
Bytedance
B
Bingni Zhang
Bytedance
J
Jiawei Wu
Bytedance
H
Hao Zhang
Nanyang Technological University
Q
Qianru Sun
Singapore Management University