Adaptive Keyframe Sampling for Long Video Understanding

📅 2025-02-28

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Long video inputs often degrade multimodal large language model (MLLM) performance due to excessive token counts exceeding context limits; existing frame-sampling strategies frequently discard semantically critical information. To address this, we propose an adaptive keyframe compression method that— for the first time—formulates keyframe selection as a joint optimization problem balancing prompt relevance and temporal coverage. Our lightweight, plug-and-play module enables efficient keyframe selection via gradient approximation and greedy search, requiring no fine-tuning of the backbone MLLM and maintaining compatibility with diverse video-language models. Evaluated on two long-video understanding benchmarks, our approach significantly improves video question answering accuracy over strong baselines. Results demonstrate that intelligent pre-filtering of video content is essential for alleviating MLLM context-length bottlenecks and enhancing downstream comprehension.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) have enabled open-world visual understanding by injecting visual input as extra tokens into large language models (LLMs) as contexts. However, when the visual input changes from a single image to a long video, the above paradigm encounters difficulty because the vast amount of video tokens has significantly exceeded the maximal capacity of MLLMs. Therefore, existing video-based MLLMs are mostly established upon sampling a small portion of tokens from input data, which can cause key information to be lost and thus produce incorrect answers. This paper presents a simple yet effective algorithm named Adaptive Keyframe Sampling (AKS). It inserts a plug-and-play module known as keyframe selection, which aims to maximize the useful information with a fixed number of video tokens. We formulate keyframe selection as an optimization involving (1) the relevance between the keyframes and the prompt, and (2) the coverage of the keyframes over the video, and present an adaptive algorithm to approximate the best solution. Experiments on two long video understanding benchmarks validate that Adaptive Keyframe Sampling improves video QA accuracy (beyond strong baselines) upon selecting informative keyframes. Our study reveals the importance of information pre-filtering in video-based MLLMs. Code is available at https://github.com/ncTimTang/AKS.

Problem

Research questions and friction points this paper is trying to address.

Overcoming MLLM token capacity limits for long videos

Reducing key information loss in video token sampling

Optimizing keyframe selection for video understanding accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Keyframe Sampling for video understanding

Optimizes keyframe relevance and video coverage

Improves accuracy in video question answering

🔎 Similar Papers

Too Many Frames, not all Useful: Efficient Strategies for Long-Form Video QA