Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma?

📅 2025-03-16

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Long video understanding faces a fundamental trade-off between sampling density and information completeness: low-density sampling risks missing critical frames, while high-density sampling introduces redundancy and reduces efficiency. To address this, we propose Necessary Sampling Density (NSD), a novel metric quantifying the minimal frame sampling rate required for task completion, and introduce LSDBench—the first benchmark explicitly designed for high-NSD long-video evaluation. We further present Reasoning-driven Hierarchical Sampling (RHS), a framework that jointly performs global coarse-grained localization and local fine-grained dense sampling, enhanced by a lightweight semantic-guided frame selector for efficient frame filtering. On LSDBench, RHS achieves performance on par with or surpassing state-of-the-art models using, on average, 42% fewer sampled frames—demonstrating for the first time the feasibility and superiority of efficient, precise sampling under high-NSD conditions.

Technology Category

Application Category

📝 Abstract

The rise of Large Vision-Language Models (LVLMs) has significantly advanced video understanding. However, efficiently processing long videos remains a challenge due to the ``Sampling Dilemma'': low-density sampling risks missing critical information, while high-density sampling introduces redundancy. To address this issue, we introduce LSDBench, the first benchmark designed to evaluate LVLMs on long-video tasks by constructing high Necessary Sampling Density (NSD) questions, where NSD represents the minimum sampling density required to accurately answer a given question. LSDBench focuses on dense, short-duration actions to rigorously assess the sampling strategies employed by LVLMs. To tackle the challenges posed by high-NSD questions, we propose a novel Reasoning-Driven Hierarchical Sampling (RHS) framework, which combines global localization of question-relevant cues with local dense sampling for precise inference. Additionally, we develop a lightweight Semantic-Guided Frame Selector to prioritize informative frames, enabling RHS to achieve comparable or superior performance with significantly fewer sampled frames. Together, our LSDBench and RHS framework address the unique challenges of high-NSD long-video tasks, setting a new standard for evaluating and improving LVLMs in this domain.

Problem

Research questions and friction points this paper is trying to address.

Challenges in processing long videos efficiently

Balancing sampling density to avoid redundancy and information loss

Evaluating and improving Large Vision-Language Models on high-NSD tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

LSDBench evaluates LVLMs on long-video tasks.

RHS framework combines global and local sampling.

Semantic-Guided Frame Selector reduces redundant frames.

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs