🤖 AI Summary
This work addresses the critical challenge of efficiently selecting informative and globally representative frames from long videos under constrained visual token budgets for video-based multimodal large language models. The authors propose a training-free, plug-and-play frame sampling framework that, for the first time, incorporates a linear-complexity Determinantal Point Process (DPP) into video sampling. They introduce a Group DPP importance metric to jointly optimize non-redundant keyframe selection and dynamic resolution allocation within a task-conditioned feature space. The method effectively balances global diversity and computational efficiency, achieving consistent and significant improvements over existing approaches across four video benchmarks—gaining 2.5 points under low token budgets and 1.6 points under high budgets—and demonstrates broad compatibility with both open- and closed-source multimodal large language models.
📝 Abstract
Video understanding in multimodal large language models requires selecting informative frames from long, redundant videos under limited visual-token budgets. Existing methods often rely on uniform sampling, point-wise relevance scoring, chunk-wise selection, or agentic exploration, which either miss global dependencies or introduce substantial overhead. We propose LDDR (Linear DPP-Based Dynamic Resolution), a training-free, plug-and-play, and budget-aware video frame sampling framework. LDDR performs query-aware Determinantal Point Process (DPP) frame selection in a task-conditioned feature space, achieving a 3x runtime speedup over standard DPP baselines. It further introduces a Group DPP importance metric to guide frame retention and dynamic resolution allocation, assigning more tokens to informative, non-redundant frames while downscaling or pruning less useful ones. Across four video benchmarks spanning short-, medium-, and long-range videos, LDDR consistently outperforms the next-best baselines, achieving gains of 2.5 points under budget-constrained settings and 1.6 points in high-budget scenarios. These improvements are consistently observed across multiple MLLM backbones, including both open- and closed-source models. Qualitative analysis confirms that relevant frames are selected and allocated a higher budget, facilitating improved video understanding.