MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods

📅 2026-01-29
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited performance of open-source vision-language models on complex multimodal reasoning tasks—such as STEM diagrams and visual puzzles—attributed primarily to the scarcity of high-quality, long-chain-of-thought annotated data. To overcome this, the authors propose MMFineReason, a three-stage pipeline involving standardized data collection, chain-of-thought generation, and difficulty-aware filtering, which yields a large-scale reasoning dataset comprising 1.8 million samples and 5.1 billion tokens. Notably, they find that only 7% of the highest-quality subset suffices to match full-dataset performance. Leveraging this distilled data for instruction tuning of Qwen3-VL-Instruct, the resulting MMFineReason-4B outperforms Qwen3-VL-8B-Thinking, while MMFineReason-8B surpasses Qwen3-VL-30B-A3B-Thinking and approaches Qwen3-VL-32B-Thinking, achieving state-of-the-art results at comparable scales and demonstrating that reasoning-oriented data can jointly enhance general capabilities.

Technology Category

Application Category

📝 Abstract
Recent advances in Vision Language Models (VLMs) have driven significant progress in visual reasoning. However, open-source VLMs still lag behind proprietary systems, largely due to the lack of high-quality reasoning data. Existing datasets offer limited coverage of challenging domains such as STEM diagrams and visual puzzles, and lack consistent, long-form Chain-of-Thought (CoT) annotations essential for eliciting strong reasoning capabilities. To bridge this gap, we introduce MMFineReason, a large-scale multimodal reasoning dataset comprising 1.8M samples and 5.1B solution tokens, featuring high-quality reasoning annotations distilled from Qwen3-VL-235B-A22B-Thinking. The dataset is established via a systematic three-stage pipeline: (1) large-scale data collection and standardization, (2) CoT rationale generation, and (3) comprehensive selection based on reasoning quality and difficulty awareness. The resulting dataset spans STEM problems, visual puzzles, games, and complex diagrams, with each sample annotated with visually grounded reasoning traces. We fine-tune Qwen3-VL-Instruct on MMFineReason to develop MMFineReason-2B/4B/8B versions. Our models establish new state-of-the-art results for their size class. Notably, MMFineReason-4B succesfully surpasses Qwen3-VL-8B-Thinking, and MMFineReason-8B even outperforms Qwen3-VL-30B-A3B-Thinking while approaching Qwen3-VL-32B-Thinking, demonstrating remarkable parameter efficiency. Crucially, we uncover a"less is more"phenomenon via our difficulty-aware filtering strategy: a subset of just 7\% (123K samples) achieves performance comparable to the full dataset. Notably, we reveal a synergistic effect where reasoning-oriented data composition simultaneously boosts general capabilities.
Problem

Research questions and friction points this paper is trying to address.

Vision Language Models
multimodal reasoning
Chain-of-Thought
reasoning data
STEM diagrams
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal reasoning
Chain-of-Thought distillation
difficulty-aware filtering
data-centric AI
vision-language models
🔎 Similar Papers
No similar papers found.
Honglin Lin
Honglin Lin
SJTU
Zheng Liu
Zheng Liu
School of Computer Science
Geometry modeling3D deep learningComputer graphics3D vision
Y
Yun Zhu
Shanghai Artificial Intelligence Laboratory, OpenDataLab
C
Chonghan Qin
Shanghai Artificial Intelligence Laboratory, OpenDataLab; The University of Hong Kong
J
Juekai Lin
Shanghai Artificial Intelligence Laboratory, OpenDataLab
X
Xiaoran Shang
Shanghai Artificial Intelligence Laboratory, OpenDataLab
Conghui He
Conghui He
Shanghai AI Laboratory
Data-centric AILLMDocument Intelligence
Wentao Zhang
Wentao Zhang
Institute of Physics, Chinese Academy of Sciences
photoemissionsuperconductivitycupratehtsctime-resolved
Lijun Wu
Lijun Wu
Shanghai AI Laboratory
MLLLMAI4Science