🤖 AI Summary
This work addresses the high computational cost of video large language models caused by processing dense frames, as well as the limitations of existing keyframe selection methods that often fall into local optima and introduce noisy frames. To this end, the authors propose a training-free frame selection framework that introduces, for the first time, a “directed diversity” metric to unify relevance and diversity into an irreplaceability score. Coupled with a budget-aware adaptive iterative mechanism, the method dynamically optimizes temporal context coverage while preserving core semantics. Evaluated on the LLaVA-Video-7B model across long-video benchmarks, the approach achieves an average performance gain of 12.5%, significantly outperforming baseline strategies such as uniform sampling.
📝 Abstract
Video Large Language Models (VLMs) have achieved remarkable success in video understanding, but the significant computational cost from processing dense frames severely limits their practical application. Existing methods alleviate this by selecting keyframes, but their greedy decision-making, combined with a decoupled evaluation of relevance and diversity, often falls into local optima and results in erroneously selecting irrelevant noise frames. To address these challenges, we propose GIFT: Global Irreplaceability Frame Targeting, a novel training-free framework that selects frames by assessing their intrinsic irreplaceability. Specifically, we first introduce Directed Diversity to quantify a frame's uniqueness conditioned on relevance, which allows us to formulate a unified irreplaceability score. Subsequently, our Budget-Aware Refinement strategy employs a adaptive iterative process that first secures a core set of frames with the highest irreplaceability, and then shifts its priority to building crucial temporal context around these selections as the budget expands. Extensive experiments demonstrate that GIFT achieves a maximum average improvement of 12.5% across long-form video benchmarks on LLaVA-Video-7B compared to uniform sampling.