🤖 AI Summary
This work addresses the challenges of sparse supervision and environmental unpredictability faced by multimodal agents in knowledge-intensive visual reasoning. The authors propose a simulation-to-reality (Sim-to-Real) training paradigm that decouples policy learning into a deterministic, static sandbox environment. They introduce the first introspective reward mechanism grounded in cognitive processes, which selectively triggers multimodal or textual search only when visual or factual uncertainty is high. By integrating reinforcement learning, multimodal reasoning, and process-oriented rewards, the method enables efficient agent training without requiring interaction in real-world environments. The approach achieves new state-of-the-art performance, outperforming previous best models by 5.1%, 6.3%, and 11.3% on FVQA-test, InfoSeek, and MMSearch benchmarks, respectively.
📝 Abstract
Training multimodal agents via reinforcement learning for knowledge-intensive visual reasoning is fundamentally hindered by the extreme sparsity of outcome-based supervision and the unpredictability of live web environments. To resolve these algorithmic and environmental bottlenecks, we introduce ProMMSearchAgent, establishing a novel Sim-to-Real training paradigm for multimodal search.
We decouple policy learning into a deterministic, local static sandbox. Crucially, to learn effectively within this constrained environment, we propose an introspective process-oriented reward. By probing the agent's own parametric knowledge boundaries, we generate dense behavioral metadata that explicitly rewards the correct cognitive decision, initiating a multimodal or text search only when visually or factually uncertain. Extensive experiments demonstrate that our locally-trained policy transfers zero-shot to the live Google Search API. ProMMSearchAgent achieves new SOTA performance, outperforming MMSearch-R1 by +5.1% on FVQA-test, +6.3% on InfoSeek, and +11.3% on MMSearch.