GIFT: Global Irreplaceability Frame Targeting for Efficient Video Understanding

📅 2026-03-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost of video large language models caused by processing dense frames, as well as the limitations of existing keyframe selection methods that often fall into local optima and introduce noisy frames. To this end, the authors propose a training-free frame selection framework that introduces, for the first time, a “directed diversity” metric to unify relevance and diversity into an irreplaceability score. Coupled with a budget-aware adaptive iterative mechanism, the method dynamically optimizes temporal context coverage while preserving core semantics. Evaluated on the LLaVA-Video-7B model across long-video benchmarks, the approach achieves an average performance gain of 12.5%, significantly outperforming baseline strategies such as uniform sampling.

Technology Category

Application Category

📝 Abstract
Video Large Language Models (VLMs) have achieved remarkable success in video understanding, but the significant computational cost from processing dense frames severely limits their practical application. Existing methods alleviate this by selecting keyframes, but their greedy decision-making, combined with a decoupled evaluation of relevance and diversity, often falls into local optima and results in erroneously selecting irrelevant noise frames. To address these challenges, we propose GIFT: Global Irreplaceability Frame Targeting, a novel training-free framework that selects frames by assessing their intrinsic irreplaceability. Specifically, we first introduce Directed Diversity to quantify a frame's uniqueness conditioned on relevance, which allows us to formulate a unified irreplaceability score. Subsequently, our Budget-Aware Refinement strategy employs a adaptive iterative process that first secures a core set of frames with the highest irreplaceability, and then shifts its priority to building crucial temporal context around these selections as the budget expands. Extensive experiments demonstrate that GIFT achieves a maximum average improvement of 12.5% across long-form video benchmarks on LLaVA-Video-7B compared to uniform sampling.
Problem

Research questions and friction points this paper is trying to address.

Video Understanding
Frame Selection
Computational Cost
Keyframe Sampling
Local Optima
Innovation

Methods, ideas, or system contributions that make the work stand out.

Global Irreplaceability
Directed Diversity
Budget-Aware Refinement
Frame Selection
Video Large Language Models
🔎 Similar Papers
No similar papers found.
J
Junpeng Ma
Institute of Science and Technology for Brain-inspired Intelligence, Fudan University
S
Sashuai Zhou
Zhejiang University
Guanghao Li
Guanghao Li
Fudan University
Graphics
Xin Gao
Xin Gao
China university of mining and technology-beijing
Y
Yue Cao
Alibaba Group Holding Limited
H
Hengyu Zeng
Institute of Science and Technology for Brain-inspired Intelligence, Fudan University
Y
Yuxiang Yan
Institute of Science and Technology for Brain-inspired Intelligence, Fudan University
Zhibin Wang
Zhibin Wang
Zhejiang University
new particle formationaerosolshygroscopicityblack carbon
Jun Song
Jun Song
Shenzhen University
nanophotonics
Bo Zheng
Bo Zheng
Researcher, Alibaba Group
AINetworkE-Commerce
Shanghang Zhang
Shanghang Zhang
Peking University
Embodied AIFoundation Models
Jian Pu
Jian Pu
Institute of Science and Technology for Brain-inspired Intelligence, Fudan University
Autonomous SystemsComputer VisionMachine Learning