ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards

📅 2026-04-22

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the challenges of sparse supervision and environmental unpredictability faced by multimodal agents in knowledge-intensive visual reasoning. The authors propose a simulation-to-reality (Sim-to-Real) training paradigm that decouples policy learning into a deterministic, static sandbox environment. They introduce the first introspective reward mechanism grounded in cognitive processes, which selectively triggers multimodal or textual search only when visual or factual uncertainty is high. By integrating reinforcement learning, multimodal reasoning, and process-oriented rewards, the method enables efficient agent training without requiring interaction in real-world environments. The approach achieves new state-of-the-art performance, outperforming previous best models by 5.1%, 6.3%, and 11.3% on FVQA-test, InfoSeek, and MMSearch benchmarks, respectively.

Technology Category

Application Category

📝 Abstract

Training multimodal agents via reinforcement learning for knowledge-intensive visual reasoning is fundamentally hindered by the extreme sparsity of outcome-based supervision and the unpredictability of live web environments. To resolve these algorithmic and environmental bottlenecks, we introduce ProMMSearchAgent, establishing a novel Sim-to-Real training paradigm for multimodal search. We decouple policy learning into a deterministic, local static sandbox. Crucially, to learn effectively within this constrained environment, we propose an introspective process-oriented reward. By probing the agent's own parametric knowledge boundaries, we generate dense behavioral metadata that explicitly rewards the correct cognitive decision, initiating a multimodal or text search only when visually or factually uncertain. Extensive experiments demonstrate that our locally-trained policy transfers zero-shot to the live Google Search API. ProMMSearchAgent achieves new SOTA performance, outperforming MMSearch-R1 by +5.1% on FVQA-test, +6.3% on InfoSeek, and +11.3% on MMSearch.

Problem

Research questions and friction points this paper is trying to address.

multimodal agents

reinforcement learning

knowledge-intensive visual reasoning

sparse supervision

unpredictable web environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal search agent

process-oriented reward

Sim-to-Real training