Insight-V++: Towards Advanced Long-Chain Visual Reasoning with Multimodal Large Language Models

📅 2026-03-18

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

Current multimodal large language models are limited in long-chain visual reasoning due to the scarcity of high-quality reasoning data and inadequate training mechanisms. This work proposes a unified multi-agent visual reasoning framework that automatically generates multi-granularity, structured reasoning trajectories from images and videos, and establishes an iterative self-optimization loop by coordinating reasoning and summarization agents. The study introduces novel ST-GRPO and J-GRPO algorithms that overcome the off-policy limitations of traditional DPO, thereby enhancing spatiotemporal reasoning capabilities. Built upon LLaVA-NeXT and Qwen2.5-VL base models, the proposed approach significantly outperforms existing methods on both image and video long-chain reasoning benchmarks while maintaining strong general perception performance.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have achieved remarkable reliability and advanced capabilities through extended test-time reasoning. However, extending these capabilities to Multi-modal Large Language Models (MLLMs) remains a significant challenge due to a critical scarcity of high-quality, long-chain reasoning data and optimized training pipelines. To bridge this gap, we present a unified multi-agent visual reasoning framework that systematically evolves from our foundational image-centric model, Insight-V, into a generalized spatial-temporal architecture, Insight-V++. We first propose a scalable data generation pipeline equipped with multi-granularity assessment that autonomously synthesizes structured, complex reasoning trajectories across image and video domains without human intervention. Recognizing that directly supervising MLLMs with such intricate data yields sub-optimal results, we design a dual-agent architecture comprising a reasoning agent to execute extensive analytical chains, and a summary agent to critically evaluate and distill final outcomes. While our initial framework utilized Direct Preference Optimization (DPO), its off-policy nature fundamentally constrained reinforcement learning potential. To overcome these limitations, particularly for long-horizon video understanding, Insight-V++ introduces two novel algorithms, ST-GRPO and J-GRPO, which enhance spatial-temporal reasoning and improve evaluative robustness. Crucially, by leveraging reliable feedback from the summary agent, we guide an iterative reasoning path generation process, retraining the entire multi-agent system in a continuous, self-improving loop. Extensive experiments on base models like LLaVA-NeXT and Qwen2.5-VL demonstrate significant performance gains across challenging image and video reasoning benchmarks while preserving strong capabilities on traditional perception-focused tasks.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models

Long-Chain Visual Reasoning

High-Quality Reasoning Data

Training Pipeline

Spatial-Temporal Reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent reasoning

long-chain visual reasoning

spatio-temporal reasoning