🤖 AI Summary
This work addresses the limitations of existing autonomous information retrieval agents, which are predominantly confined to textual modalities and struggle to efficiently acquire and integrate information in multimodal environments. Key challenges include the trade-off between generalization and specialization in tool usage and the scarcity of multi-hop multimodal training data. To overcome these issues, the authors propose a modular multimodal retrieval agent that explicitly decouples information acquisition from answer generation. They introduce a retrieval-oriented multi-objective reinforcement learning framework that jointly optimizes factual accuracy, reasoning plausibility, and retrieval fidelity. Additionally, they construct MMSearchVQA, the first dataset designed to support multi-hop multimodal retrieval training. Experimental results demonstrate that the proposed approach significantly outperforms existing models on complex multimodal tasks, exhibiting strong transferability and reasoning capabilities.
📝 Abstract
Recent advances in DeepResearch-style agents have demonstrated strong capabilities in autonomous information acquisition and synthesize from real-world web environments. However, existing approaches remain fundamentally limited to text modality. Extending autonomous information-seeking agents to multimodal settings introduces critical challenges: the specialization-generalization trade-off that emerges when training models for multimodal tool-use at scale, and the severe scarcity of training data capturing complex, multi-step multimodal search trajectories. To address these challenges, we propose M$^3$Searcher, a modular multimodal information-seeking agent that explicitly decouples information acquisition from answer derivation. M$^3$Searcher is optimized with a retrieval-oriented multi-objective reward that jointly encourages factual accuracy, reasoning soundness, and retrieval fidelity. In addition, we develop MMSearchVQA, a multimodal multi-hop dataset to support retrieval centric RL training. Experimental results demonstrate that M$^3$Searcher outperforms existing approaches, exhibits strong transfer adaptability and effective reasoning in complex multimodal tasks.