OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe

📅 2025-11-20

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Multimodal reasoning research is hindered by opaque data curation and non-reproducible training strategies. To address this, we propose an open, general two-stage training paradigm: (1) supervised fine-tuning (SFT) on a rigorously validated cold-start dataset comprising 874K samples; followed by (2) policy optimization via reinforcement learning (RL) using 74K cross-domain RL samples. Our approach integrates heterogeneous, multi-source data while explicitly co-designing data quality assurance and training methodology. Evaluated across nine major multimodal reasoning benchmarks, our method achieves a +11.6% average improvement over the Qwen2.5-VL-7B-Instruct baseline. Crucially, we fully open-source all code, datasets, and end-to-end training pipelines—establishing the first systematic empirical foundation and infrastructure for reproducible, scalable multimodal reasoning research.

Technology Category

Application Category

📝 Abstract

Recent advancements in large reasoning models have fueled growing interest in extending such capabilities to multimodal domains. However, despite notable progress in visual reasoning, the lack of transparent and reproducible data curation and training strategies remains a major barrier to scalable research. In this work, we introduce OpenMMReasoner, a fully transparent two-stage recipe for multimodal reasoning spanning supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct an 874K-sample cold-start dataset with rigorous step-by-step validation, providing a strong foundation for reasoning capabilities. The subsequent RL stage leverages a 74K-sample dataset across diverse domains to further sharpen and stabilize these abilities, resulting in a more robust and efficient learning process. Extensive evaluations demonstrate that our training recipe not only surpasses strong baselines but also highlights the critical role of data quality and training design in shaping multimodal reasoning performance. Notably, our method achieves a 11.6% improvement over the Qwen2.5-VL-7B-Instruct baseline across nine multimodal reasoning benchmarks, establishing a solid empirical foundation for future large-scale multimodal reasoning research. We open-sourced all our codes, pipeline, and data at https://github.com/EvolvingLMMs-Lab/OpenMMReasoner.

Problem

Research questions and friction points this paper is trying to address.

Developing transparent multimodal reasoning models with reproducible training strategies

Addressing data quality and scalable training for visual reasoning capabilities

Creating open recipe for multimodal reasoning spanning SFT and RL stages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage recipe with SFT and RL

Cold-start dataset with step-by-step validation

Domain-diverse RL dataset for robust learning

🔎 Similar Papers

OV-MER: Towards Open-Vocabulary Multimodal Emotion Recognition