InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search

📅 2025-12-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current open multimodal agents exhibit limited capability in real-world tasks requiring multi-step reasoning grounded in fine-grained visual details—e.g., dense chart document analysis and map navigation. To address this, we introduce O3-Bench, the first benchmark to formally define and instantiate generalized visual search—a task demanding iterative visual grounding, semantic interpretation, and spatial-temporal reasoning. We further propose InSight-o3, a multi-agent framework featuring a vReasoner–vSearcher collaborative architecture and a reinforcement learning–fine-tuned specialized multimodal large language model. Key technical innovations include interleaved cross-modal attention, vision–language aligned representations, and fine-grained reasoning evaluation metrics. The vSearcher plugin significantly enhances visual grounding and sequential reasoning: on O3-Bench, our method achieves 72.3% accuracy—substantially outperforming OpenAI o3 (40.8%)—and demonstrates robust, stepwise reasoning improvements in complex document understanding and map navigation tasks.

Technology Category

Application Category

📝 Abstract
The ability for AI agents to "think with images" requires a sophisticated blend of reasoning and perception. However, current open multimodal agents still largely fall short on the reasoning aspect crucial for real-world tasks like analyzing documents with dense charts/diagrams and navigating maps. To address this gap, we introduce O3-Bench, a new benchmark designed to evaluate multimodal reasoning with interleaved attention to visual details. O3-Bench features challenging problems that require agents to piece together subtle visual information from distinct image areas through multi-step reasoning. The problems are highly challenging even for frontier systems like OpenAI o3, which only obtains 40.8% accuracy on O3-Bench. To make progress, we propose InSight-o3, a multi-agent framework consisting of a visual reasoning agent (vReasoner) and a visual search agent (vSearcher) for which we introduce the task of generalized visual search -- locating relational, fuzzy, or conceptual regions described in free-form language, beyond just simple objects or figures in natural images. We then present a multimodal LLM purpose-trained for this task via reinforcement learning. As a plug-and-play agent, our vSearcher empowers frontier multimodal models (as vReasoners), significantly improving their performance on a wide range of benchmarks. This marks a concrete step towards powerful o3-like open systems. Our code and dataset can be found at https://github.com/m-Just/InSight-o3 .
Problem

Research questions and friction points this paper is trying to address.

Addresses multimodal AI's reasoning deficiency in complex visual tasks
Introduces benchmark for evaluating interleaved visual attention and reasoning
Proposes framework for generalized visual search beyond simple object recognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent framework with visual reasoning and search agents
Generalized visual search for relational and conceptual regions
Multimodal LLM trained via reinforcement learning for plug-and-play use
🔎 Similar Papers
No similar papers found.