🤖 AI Summary
Current open multimodal agents exhibit limited capability in real-world tasks requiring multi-step reasoning grounded in fine-grained visual details—e.g., dense chart document analysis and map navigation. To address this, we introduce O3-Bench, the first benchmark to formally define and instantiate generalized visual search—a task demanding iterative visual grounding, semantic interpretation, and spatial-temporal reasoning. We further propose InSight-o3, a multi-agent framework featuring a vReasoner–vSearcher collaborative architecture and a reinforcement learning–fine-tuned specialized multimodal large language model. Key technical innovations include interleaved cross-modal attention, vision–language aligned representations, and fine-grained reasoning evaluation metrics. The vSearcher plugin significantly enhances visual grounding and sequential reasoning: on O3-Bench, our method achieves 72.3% accuracy—substantially outperforming OpenAI o3 (40.8%)—and demonstrates robust, stepwise reasoning improvements in complex document understanding and map navigation tasks.
📝 Abstract
The ability for AI agents to "think with images" requires a sophisticated blend of reasoning and perception. However, current open multimodal agents still largely fall short on the reasoning aspect crucial for real-world tasks like analyzing documents with dense charts/diagrams and navigating maps. To address this gap, we introduce O3-Bench, a new benchmark designed to evaluate multimodal reasoning with interleaved attention to visual details. O3-Bench features challenging problems that require agents to piece together subtle visual information from distinct image areas through multi-step reasoning. The problems are highly challenging even for frontier systems like OpenAI o3, which only obtains 40.8% accuracy on O3-Bench. To make progress, we propose InSight-o3, a multi-agent framework consisting of a visual reasoning agent (vReasoner) and a visual search agent (vSearcher) for which we introduce the task of generalized visual search -- locating relational, fuzzy, or conceptual regions described in free-form language, beyond just simple objects or figures in natural images. We then present a multimodal LLM purpose-trained for this task via reinforcement learning. As a plug-and-play agent, our vSearcher empowers frontier multimodal models (as vReasoners), significantly improving their performance on a wide range of benchmarks. This marks a concrete step towards powerful o3-like open systems. Our code and dataset can be found at https://github.com/m-Just/InSight-o3 .