Long Grounded Thoughts: Distilling Compositional Visual Reasoning Chains at Scale

📅 2025-11-07

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses three longstanding challenges in multimodal reasoning: the lack of publicly available datasets, non-reproducible synthetic data generation methods, and scarcity of vision-centric training data. We propose a two-stage visual reasoning data synthesis framework that synergistically leverages vision-language models (VLMs) and reasoning-oriented large language models (LLMs) to generate millions of high-quality, multi-skill, multi-difficulty visual reasoning chains—supporting both offline and online reinforcement learning. To our knowledge, this is the first approach to achieve positive cross-modal transfer of purely vision-derived training data to text, audio, and embodied tasks. Fine-tuning Qwen2.5-VL-7B on our synthetic data yields state-of-the-art performance among open-source models—and competitive results against some closed-source models—on major visual understanding benchmarks, while significantly improving generalization across modalities. Our framework establishes a new paradigm and foundational resource for reproducible, scalable, vision-centered reasoning research.

Technology Category

Application Category

📝 Abstract

Recent progress in multimodal reasoning has been driven largely by undisclosed datasets and proprietary data synthesis recipes, leaving open questions about how to systematically build large-scale, vision-centric reasoning datasets, particularly for tasks that go beyond visual math. In this work, we introduce a new reasoning data generation framework spanning diverse skills and levels of complexity with over 1M high-quality synthetic vision-centric questions. The dataset also includes preference data and instruction prompts supporting both offline and online RL. Our synthesis framework proceeds in two stages: (1) scale; and (2) complexity. Reasoning traces are then synthesized through a two-stage process that leverages VLMs and reasoning LLMs, producing CoT traces for VLMs that capture the richness and diverse cognitive behaviors found in frontier reasoning models. Remarkably, we show that finetuning Qwen2.5-VL-7B on our data outperforms all open-data baselines across all evaluated vision-centric benchmarks, and even surpasses strong closed-data models such as MiMo-VL-7B-RL on V* Bench, CV-Bench and MMStar-V. Perhaps most surprising, despite being entirely vision-centric, our data transfers positively to text-only reasoning (MMLU-Pro) and audio reasoning (MMAU), demonstrating its effectiveness. Similarly, despite not containing videos or embodied visual data, we observe notable gains when evaluating on a single-evidence embodied QA benchmark (NiEH). Finally, we use our data to analyze the entire VLM post-training pipeline. Our empirical analysis highlights that (i) SFT on high-quality data with non-linear reasoning traces is essential for effective online RL, (ii) staged offline RL matches online RL's performance while reducing compute demands, and (iii) careful SFT on high quality data can substantially improve out-of-domain, cross-modality transfer.

Problem

Research questions and friction points this paper is trying to address.

Building large-scale vision-centric reasoning datasets systematically

Developing synthetic data generation for multimodal reasoning tasks

Enhancing VLM performance through compositional reasoning chains

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic vision-centric question generation framework

Two-stage reasoning synthesis using VLMs and LLMs

Offline RL matching online RL with less compute

🔎 Similar Papers

What Makes a Maze Look Like a Maze?

2024-09-12arXiv.orgCitations: 1