🤖 AI Summary
This work addresses two key challenges in long-form video understanding: temporal redundancy that obscures critical cues and hallucination-prone reasoning when relying solely on textual inference. To tackle these issues, the authors propose Video-TwG, a novel framework featuring a “think-while-grounding” mechanism that enables video large language models to dynamically attend to question-relevant segments during reasoning. The approach employs a two-stage curriculum reinforcement learning strategy with a newly designed TwG-GRPO algorithm, integrating fine-grained grounding rewards, self-confirming pseudo-rewards, and an accuracy-gated mechanism to suppress redundant localization and enhance reasoning fidelity. Extensive experiments on Video-MME, LongVideoBench, and MLVU benchmarks demonstrate substantial performance gains over existing methods, confirming the framework’s effectiveness and generalization capability.
📝 Abstract
Long video understanding is challenging due to rich and complicated multimodal clues in long temporal range.Current methods adopt reasoning to improve the model's ability to analyze complex video clues in long videos via text-form reasoning.However,the existing literature suffers from the fact that the text-only reasoning under fixed video context may exacerbate hallucinations since detailed crucial clues are often ignored under limited video context length due to the temporal redundancy of long videos.To address this gap,we propose Video-TwG,a curriculum reinforced framework that employs a novel Think-with-Grounding paradigm,enabling video LLMs to actively decide when to perform on-demand grounding during interleaved text-video reasoning, selectively zooming into question-relevant clips only when necessary.Video-TwG can be trained end-to-end in a straightforward manner, without relying on complex auxiliary modules or heavily annotated reasoning tracesIn detail,we design a Two-stage Reinforced Curriculum Strategy, where the model first learns think-with-grounding behavior on a small short-video GQA dataset with grounding labels,and then scales to diverse general QA data with videos of diverse domains to encourage generalization. Further, to handle complex think-with-grounding reasoning for various kinds of data,we propose TwG-GRPO algorithm which features the fine-grained grounding reward, self-confirmed pseudo reward and accuracy-gated mechanism.Finally,we propose to construct a new TwG-51K dataset that facilitates training. Experiments on Video-MME, LongVideoBench, and MLVU show that Video-TwG consistently outperforms strong LVU baselines.Further ablation validates the necessity of our Two-stage Reinforced Curriculum Strategy and shows our TwG-GRPO better leverages diverse unlabeled data to improve grounding quality and reduce redundant groundings without sacrificing QA performance.