🤖 AI Summary
This work investigates how to effectively transfer cognitively sophisticated behaviors—acquired by large language models (LLMs) via reinforcement learning with verifiable rewards—to multimodal large language models (MLLMs) to enhance visual reasoning. We propose a two-stage transfer paradigm: first, large-scale language-only cold-start fine-tuning on Qwen2.5-VL-7B to activate linguistic mental imagery and facilitate early behavioral transfer; second,近千-step multimodal reinforcement learning guided by verifiable rewards to selectively amplify high-value visual behaviors (e.g., visual reflection) and suppress inefficient patterns. Our method achieves state-of-the-art results on MATH500 (95.3%), MathVision (51.8%), and MathVerse (54.6%), substantially outperforming existing approaches. To foster reproducibility and further research, we fully open-source the model, datasets, and training dynamics.
📝 Abstract
The remarkable reasoning capability of large language models (LLMs) stems from cognitive behaviors that emerge through reinforcement with verifiable rewards. This work investigates how to transfer this principle to Multimodal LLMs (MLLMs) to unlock advanced visual reasoning. We introduce a two-stage paradigm built on Qwen2.5-VL-7B: a massive linguistic cold-start fine-tuning, followed by multimodal reinforcement learning (RL) spanning nearly 1,000 steps, surpassing all previous open-source efforts in scale. This pioneering work reveals three fundamental insights: 1) Behavior transfer emerges surprisingly early in cold start due to linguistic mental imagery. 2) Cold start broadly memorizes visual behaviors, while RL critically discerns and scales up effective patterns. 3) Transfer strategically favors high-utility behaviors such as visual reflection. Our resulting model, Open-Vision-Reasoner (OVR), achieves state-of-the-art performance on a suite of reasoning benchmarks, including 95.3% on MATH500, 51.8% on MathVision and 54.6% on MathVerse. We release our model, data, and training dynamics to catalyze the development of more capable, behavior-aligned multimodal reasoners.