MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods

📅 2026-01-29

📈 Citations: 1

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the limited performance of open-source vision-language models on complex multimodal reasoning tasks—such as STEM diagrams and visual puzzles—attributed primarily to the scarcity of high-quality, long-chain-of-thought annotated data. To overcome this, the authors propose MMFineReason, a three-stage pipeline involving standardized data collection, chain-of-thought generation, and difficulty-aware filtering, which yields a large-scale reasoning dataset comprising 1.8 million samples and 5.1 billion tokens. Notably, they find that only 7% of the highest-quality subset suffices to match full-dataset performance. Leveraging this distilled data for instruction tuning of Qwen3-VL-Instruct, the resulting MMFineReason-4B outperforms Qwen3-VL-8B-Thinking, while MMFineReason-8B surpasses Qwen3-VL-30B-A3B-Thinking and approaches Qwen3-VL-32B-Thinking, achieving state-of-the-art results at comparable scales and demonstrating that reasoning-oriented data can jointly enhance general capabilities.

Technology Category

Application Category

📝 Abstract

Recent advances in Vision Language Models (VLMs) have driven significant progress in visual reasoning. However, open-source VLMs still lag behind proprietary systems, largely due to the lack of high-quality reasoning data. Existing datasets offer limited coverage of challenging domains such as STEM diagrams and visual puzzles, and lack consistent, long-form Chain-of-Thought (CoT) annotations essential for eliciting strong reasoning capabilities. To bridge this gap, we introduce MMFineReason, a large-scale multimodal reasoning dataset comprising 1.8M samples and 5.1B solution tokens, featuring high-quality reasoning annotations distilled from Qwen3-VL-235B-A22B-Thinking. The dataset is established via a systematic three-stage pipeline: (1) large-scale data collection and standardization, (2) CoT rationale generation, and (3) comprehensive selection based on reasoning quality and difficulty awareness. The resulting dataset spans STEM problems, visual puzzles, games, and complex diagrams, with each sample annotated with visually grounded reasoning traces. We fine-tune Qwen3-VL-Instruct on MMFineReason to develop MMFineReason-2B/4B/8B versions. Our models establish new state-of-the-art results for their size class. Notably, MMFineReason-4B succesfully surpasses Qwen3-VL-8B-Thinking, and MMFineReason-8B even outperforms Qwen3-VL-30B-A3B-Thinking while approaching Qwen3-VL-32B-Thinking, demonstrating remarkable parameter efficiency. Crucially, we uncover a"less is more"phenomenon via our difficulty-aware filtering strategy: a subset of just 7\% (123K samples) achieves performance comparable to the full dataset. Notably, we reveal a synergistic effect where reasoning-oriented data composition simultaneously boosts general capabilities.

Problem

Research questions and friction points this paper is trying to address.

Vision Language Models

multimodal reasoning

Chain-of-Thought

reasoning data

STEM diagrams

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal reasoning

Chain-of-Thought distillation

difficulty-aware filtering