Select Less, Reason More: Prioritizing Evidence Purity for Video Reasoning

📅 2025-10-17

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Long video reasoning suffers from critical information dilution due to static frame sampling, while existing pixel-level interactive agents lack guarantees of evidence purity and temporal scalability. To address this, we propose the “Select-Less-Reason-More” framework, introducing Evidence-Aware Reinforcement Learning (EARL)—the first approach that models Video LLMs as evidence interrogators, jointly optimizing high-purity visual evidence selection and localized temporal resampling. Our method integrates dynamic keyframe selection, fine-grained temporal region resampling, and end-to-end reinforcement training to enable evidence-driven adaptive inference. Evaluated on five mainstream long-video benchmarks, our open-weight 7B model achieves state-of-the-art performance among open-source models: 59.8% on LongVideoBench, 69.0% on MVBench, and 64.9% on VideoMME—demonstrating substantial gains in complex temporal reasoning.

Technology Category

Application Category

📝 Abstract

Long-form video reasoning remains a major challenge for Video Large Language Models (Video LLMs), as static uniform frame sampling leads to information dilution and obscures critical evidence. Furthermore, existing pixel-space video reasoning agents, which are designed to actively interact with the video to acquire new visual information, remain suboptimal due to their lack of rigorous reward mechanisms to enforce evidence purity and their inability to perform temporal information supplementation beyond pre-sampled frames. To address this critical gap, we propose a novel evidence-prioritized adaptive framework built upon our core philosophy: "Select Less, Reason More." Our core contribution is the evidence-aware reinforcement learning (EARL) framework, which transforms the model into an active interrogator of evidence. EARL is precisely engineered to dynamically select the most relevant frames and, crucially, to perform localized re-sampling around the selected key frames to access fine-grained temporal detail. Extensive experiments on five demanding video reasoning benchmarks demonstrate that our EARL-trained model achieves new state-of-the-art among open-source Video LLMs, simultaneously learning an effective and high-purity visual evidence selection policy. Impressively, our 7B model achieves 59.8% on LongVideoBench, 69.0% on MVBench and 64.9% on VideoMME. These results highlight the importance of prioritizing evidence purity and the effectiveness of our framework.

Problem

Research questions and friction points this paper is trying to address.

Addressing information dilution in long-form video reasoning

Improving evidence purity through dynamic frame selection

Enhancing temporal detail with localized re-sampling techniques

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evidence-aware reinforcement learning for dynamic frame selection

Localized re-sampling around key frames for temporal details

Active interrogator approach prioritizing evidence purity in videos

🔎 Similar Papers

Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models