Video-R1: Reinforcing Video Reasoning in MLLMs

📅 2025-03-27

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

To address the weak video reasoning capability of multimodal large language models (MLLMs), the lack of temporal modeling in existing rule-based reinforcement learning (R1) methods, and the scarcity of high-quality video reasoning data, this paper proposes the first systematic R1 paradigm. Our method introduces: (1) T-GRPO, a novel RL algorithm explicitly modeling spatiotemporal dependencies in videos; (2) an image-to-video transfer RL training paradigm that leverages abundant high-quality image reasoning data to alleviate the video annotation bottleneck; and (3) two open-source video reasoning datasets—Video-R1-COT-165k and Video-R1-260k. Evaluated on VSI-Bench (a spatial reasoning benchmark), Video-R1-7B achieves 35.8% accuracy—surpassing GPT-4o—and demonstrates significant performance gains across all major video understanding benchmarks, including VideoMMMU, MVBench, and TempCompass.

Technology Category

Application Category

📝 Abstract

Inspired by DeepSeek-R1's success in eliciting reasoning abilities through rule-based reinforcement learning (RL), we introduce Video-R1 as the first attempt to systematically explore the R1 paradigm for eliciting video reasoning within multimodal large language models (MLLMs). However, directly applying RL training with the GRPO algorithm to video reasoning presents two primary challenges: (i) a lack of temporal modeling for video reasoning, and (ii) the scarcity of high-quality video-reasoning data. To address these issues, we first propose the T-GRPO algorithm, which encourages models to utilize temporal information in videos for reasoning. Additionally, instead of relying solely on video data, we incorporate high-quality image-reasoning data into the training process. We have constructed two datasets: Video-R1-COT-165k for SFT cold start and Video-R1-260k for RL training, both comprising image and video data. Experimental results demonstrate that Video-R1 achieves significant improvements on video reasoning benchmarks such as VideoMMMU and VSI-Bench, as well as on general video benchmarks including MVBench and TempCompass, etc. Notably, Video-R1-7B attains a 35.8% accuracy on video spatial reasoning benchmark VSI-bench, surpassing the commercial proprietary model GPT-4o. All codes, models, data are released.

Problem

Research questions and friction points this paper is trying to address.

Enhancing video reasoning in MLLMs using rule-based RL

Addressing temporal modeling gaps in video reasoning tasks

Overcoming scarcity of high-quality video-reasoning training data

Innovation

Methods, ideas, or system contributions that make the work stand out.

T-GRPO algorithm enhances temporal video reasoning

Combines image and video data for training

Achieves superior performance on video reasoning benchmarks

🔎 Similar Papers

Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models