🤖 AI Summary
Existing video reasoning datasets suffer from a lack of multi-hop questions and high-quality chain-of-thought (CoT) annotations, hindering the development of large vision-language models (LVLMs) for complex video understanding. To address this, we introduce ReWatch—the first large-scale synthetic dataset specifically designed for multi-hop video reasoning. ReWatch leverages a multi-agent ReAct framework to emulate human “re-watching” behavior, enabling automatic generation of verifiable, video-grounded CoT trajectories. We further propose an observation-and-reasoning (O&R) dual-objective reward function and integrate it into verifiable-reward reinforcement learning (RLVR) to jointly optimize answer correctness and reasoning fidelity, effectively mitigating hallucination. After supervised fine-tuning and RLVR training, our method achieves state-of-the-art performance across five challenging video reasoning benchmarks, delivering substantial improvements in multi-hop reasoning, temporal modeling, and video grounding accuracy.
📝 Abstract
While Reinforcement Learning with Verifiable Reward (RLVR) significantly advances image reasoning in Large Vision-Language Models (LVLMs), its application to complex video reasoning remains underdeveloped. This gap stems primarily from a critical data bottleneck: existing datasets lack the challenging, multi-hop questions and high-quality, video-grounded Chain-of-Thought (CoT) data necessary to effectively bootstrap RLVR. To address this, we introduce ReWatch, a large-scale dataset built to foster advanced video reasoning. We propose a novel multi-stage synthesis pipeline to synthesize its three components: ReWatch-Caption, ReWatch-QA, and ReWatch-CoT. A core innovation is our Multi-Agent ReAct framework for CoT synthesis, which simulates a human-like "re-watching" process to generate video-grounded reasoning traces by explicitly modeling information retrieval and verification. Building on this dataset, we develop ReWatch-R1 by post-training a strong baseline LVLM with Supervised Fine-Tuning (SFT) and our RLVR framework. This framework incorporates a novel Observation & Reasoning (O&R) reward mechanism that evaluates both the final answer's correctness and the reasoning's alignment with video content, directly penalizing hallucination. Our experiments show that ReWatch-R1 achieves state-of-the-art average performance on five challenging video reasoning benchmarks.