🤖 AI Summary
To address the challenges of limited context windows and semantic incoherence in keyframe selection for long-video understanding, this paper proposes a scene-driven K-frames method. It leverages semantic-aware segment localization to generate query-relevant, temporally contiguous, and multi-scale adjustable keyframe sequences. Our approach is the first to achieve scene-aware frame selection without requiring additional annotations and supports arbitrary keyframe counts. We introduce the PeakClips dataset and design a three-stage progressive training paradigm that jointly incorporates supervised fine-tuning and reinforcement learning to optimize downstream task performance. Extensive experiments on mainstream long-video understanding benchmarks demonstrate significant improvements over state-of-the-art methods. The proposed framework delivers an efficient, interpretable, and plug-and-play solution for multi-scale keyframe selection.
📝 Abstract
Multimodal Large Language Models (MLLMs) have demonstrated significant capabilities in image understanding, but long-video are constrained by context windows and computational cost. Uniform frame sampling often leads to substantial information loss. Meanwhile existing keyframe selection methods such as text-frame retrieval or RL-based frame optimization typically yield sparse and temporally disjointed frames, overlooking scene continuity and lacking flexibility for multi-scale frame selection. To address these limitations, we introduce K-frames, a novel paradigm for scene-driven keyframe selection that preserves temporal continuity. Instead of selecting individual frames, K-frames predicts semantically coherent, query-relevant clips, which enables any-k keyframes selection to meet diverse user budgets. To achieve this approach, we first introduce PeakClips, a dataset of 200K video highlights conditioned by query. Building on this dataset, K-frames learns clip2frame selection using a three-stage progressive curriculum. It involves two Supervised Fine-Tuning stages for temporal grounding and key-clip perception, followed by a Reinforcement Learning stage that directly optimizes the scene-driven prediction policy for downstream task without further annotations. Extensive experiments on major long-video understanding benchmarks demonstrate that K-frames provides an effective, interpretable, and plug-and-play solution for keyframe selection at various scales. Our dataset and model will be available.