Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

📅 2025-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) exhibit limited capability in complex vision-language reasoning—such as question-asking and self-reflection—and suffer from a scarcity of high-quality multimodal chain-of-thought (CoT) data. Method: This paper introduces Vision-R1, featuring (i) a novel unlabeled, 200K-scale multimodal CoT cold-start dataset; (ii) progressive thought-suppression training (PTST) to mitigate redundant reasoning; and (iii) a hard-formatted result reward function optimized via GRPO for complex reasoning paths. Contribution/Results: Vision-R1 achieves 73.5% accuracy on MathVista—only 0.4 percentage points below OpenAI’s O1—and delivers an average ~6% improvement across multimodal mathematical reasoning benchmarks. All data and code are publicly released.

Technology Category

Application Category

📝 Abstract
DeepSeek-R1-Zero has successfully demonstrated the emergence of reasoning capabilities in LLMs purely through Reinforcement Learning (RL). Inspired by this breakthrough, we explore how RL can be utilized to enhance the reasoning capability of MLLMs. However, direct training with RL struggles to activate complex reasoning capabilities such as questioning and reflection in MLLMs, due to the absence of substantial high-quality multimodal reasoning data. To address this issue, we propose the reasoning MLLM, Vision-R1, to improve multimodal reasoning capability. Specifically, we first construct a high-quality multimodal CoT dataset without human annotations by leveraging an existing MLLM and DeepSeek-R1 through modality bridging and data filtering to obtain a 200K multimodal CoT dataset, Vision-R1-cold dataset. It serves as cold-start initialization data for Vision-R1. To mitigate the optimization challenges caused by overthinking after cold start, we propose Progressive Thinking Suppression Training (PTST) strategy and employ Group Relative Policy Optimization (GRPO) with the hard formatting result reward function to gradually refine the model's ability to learn correct and complex reasoning processes on a 10K multimodal math dataset. Comprehensive experiments show our model achieves an average improvement of $sim$6% across various multimodal math reasoning benchmarks. Vision-R1-7B achieves a 73.5% accuracy on the widely used MathVista benchmark, which is only 0.4% lower than the leading reasoning model, OpenAI O1. The datasets and code will be released in: https://github.com/Osilly/Vision-R1 .
Problem

Research questions and friction points this paper is trying to address.

Enhancing reasoning in multimodal LLMs using Reinforcement Learning.
Addressing lack of high-quality multimodal reasoning data.
Improving complex reasoning like questioning and reflection in MLLMs.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructs high-quality multimodal CoT dataset
Employs Progressive Thinking Suppression Training
Uses Group Relative Policy Optimization
🔎 Similar Papers
No similar papers found.