🤖 AI Summary
Long video reasoning suffers from critical information dilution due to static frame sampling, while existing pixel-level interactive agents lack guarantees of evidence purity and temporal scalability. To address this, we propose the “Select-Less-Reason-More” framework, introducing Evidence-Aware Reinforcement Learning (EARL)—the first approach that models Video LLMs as evidence interrogators, jointly optimizing high-purity visual evidence selection and localized temporal resampling. Our method integrates dynamic keyframe selection, fine-grained temporal region resampling, and end-to-end reinforcement training to enable evidence-driven adaptive inference. Evaluated on five mainstream long-video benchmarks, our open-weight 7B model achieves state-of-the-art performance among open-source models: 59.8% on LongVideoBench, 69.0% on MVBench, and 64.9% on VideoMME—demonstrating substantial gains in complex temporal reasoning.
📝 Abstract
Long-form video reasoning remains a major challenge for Video Large Language Models (Video LLMs), as static uniform frame sampling leads to information dilution and obscures critical evidence. Furthermore, existing pixel-space video reasoning agents, which are designed to actively interact with the video to acquire new visual information, remain suboptimal due to their lack of rigorous reward mechanisms to enforce evidence purity and their inability to perform temporal information supplementation beyond pre-sampled frames. To address this critical gap, we propose a novel evidence-prioritized adaptive framework built upon our core philosophy: "Select Less, Reason More." Our core contribution is the evidence-aware reinforcement learning (EARL) framework, which transforms the model into an active interrogator of evidence. EARL is precisely engineered to dynamically select the most relevant frames and, crucially, to perform localized re-sampling around the selected key frames to access fine-grained temporal detail. Extensive experiments on five demanding video reasoning benchmarks demonstrate that our EARL-trained model achieves new state-of-the-art among open-source Video LLMs, simultaneously learning an effective and high-purity visual evidence selection policy. Impressively, our 7B model achieves 59.8% on LongVideoBench, 69.0% on MVBench and 64.9% on VideoMME. These results highlight the importance of prioritizing evidence purity and the effectiveness of our framework.