🤖 AI Summary
To address data scarcity, training instability, and poor generalization of multimodal large language models (MLLMs) on long-horizon visual reasoning tasks, this work proposes: (1) a novel pipeline for automatic generation of long-range structured reasoning data, integrating multi-granularity automated evaluation with progressive path synthesis; (2) a dual-agent collaborative training framework—comprising a reasoning agent and a summarization agent—that decouples the reasoning process from outcome evaluation; and (3) an iterative Direct Preference Optimization (DPO) algorithm to enhance training stability and alignment quality. Implemented atop the LLaVA-NeXT architecture, our method achieves significant gains over state-of-the-art MLLMs on complex visual reasoning benchmarks, while preserving—or even improving—performance on perception-oriented tasks. This is the first demonstration that reasoning and perception capabilities in MLLMs can be jointly enhanced.
📝 Abstract
Large Language Models (LLMs) demonstrate enhanced capabilities and reliability by reasoning more, evolving from Chain-of-Thought prompting to product-level solutions like OpenAI o1. Despite various efforts to improve LLM reasoning, high-quality long-chain reasoning data and optimized training pipelines still remain inadequately explored in vision-language tasks. In this paper, we present Insight-V, an early effort to 1) scalably produce long and robust reasoning data for complex multi-modal tasks, and 2) an effective training pipeline to enhance the reasoning capabilities of multi-modal large language models (MLLMs). Specifically, to create long and structured reasoning data without human labor, we design a two-step pipeline with a progressive strategy to generate sufficiently long and diverse reasoning paths and a multi-granularity assessment method to ensure data quality. We observe that directly supervising MLLMs with such long and complex reasoning data will not yield ideal reasoning ability. To tackle this problem, we design a multi-agent system consisting of a reasoning agent dedicated to performing long-chain reasoning and a summary agent trained to judge and summarize reasoning results. We further incorporate an iterative DPO algorithm to enhance the reasoning agent's generation stability and quality. Based on the popular LLaVA-NeXT model and our stronger base MLLM, we demonstrate significant performance gains across challenging multi-modal benchmarks requiring visual reasoning. Benefiting from our multi-agent system, Insight-V can also easily maintain or improve performance on perception-focused multi-modal tasks.