Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning

📅 2025-07-07

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work investigates how to effectively transfer cognitively sophisticated behaviors—acquired by large language models (LLMs) via reinforcement learning with verifiable rewards—to multimodal large language models (MLLMs) to enhance visual reasoning. We propose a two-stage transfer paradigm: first, large-scale language-only cold-start fine-tuning on Qwen2.5-VL-7B to activate linguistic mental imagery and facilitate early behavioral transfer; second,近千-step multimodal reinforcement learning guided by verifiable rewards to selectively amplify high-value visual behaviors (e.g., visual reflection) and suppress inefficient patterns. Our method achieves state-of-the-art results on MATH500 (95.3%), MathVision (51.8%), and MathVerse (54.6%), substantially outperforming existing approaches. To foster reproducibility and further research, we fully open-source the model, datasets, and training dynamics.

Technology Category

Application Category

📝 Abstract

The remarkable reasoning capability of large language models (LLMs) stems from cognitive behaviors that emerge through reinforcement with verifiable rewards. This work investigates how to transfer this principle to Multimodal LLMs (MLLMs) to unlock advanced visual reasoning. We introduce a two-stage paradigm built on Qwen2.5-VL-7B: a massive linguistic cold-start fine-tuning, followed by multimodal reinforcement learning (RL) spanning nearly 1,000 steps, surpassing all previous open-source efforts in scale. This pioneering work reveals three fundamental insights: 1) Behavior transfer emerges surprisingly early in cold start due to linguistic mental imagery. 2) Cold start broadly memorizes visual behaviors, while RL critically discerns and scales up effective patterns. 3) Transfer strategically favors high-utility behaviors such as visual reflection. Our resulting model, Open-Vision-Reasoner (OVR), achieves state-of-the-art performance on a suite of reasoning benchmarks, including 95.3% on MATH500, 51.8% on MathVision and 54.6% on MathVerse. We release our model, data, and training dynamics to catalyze the development of more capable, behavior-aligned multimodal reasoners.

Problem

Research questions and friction points this paper is trying to address.

Transfer linguistic reasoning to multimodal models for visual tasks

Develop a two-stage training paradigm for advanced visual reasoning

Achieve state-of-the-art performance on multiple reasoning benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage paradigm with linguistic fine-tuning and multimodal RL

Linguistic mental imagery enables early behavior transfer

Reinforcement learning scales up effective visual patterns

🔎 Similar Papers

Pre-trained Vision-Language Models Learn Discoverable Visual Concepts