M$^3$Searcher: Modular Multimodal Information Seeking Agency with Retrieval-Oriented Reasoning

📅 2026-01-14

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

This work addresses the limitations of existing autonomous information retrieval agents, which are predominantly confined to textual modalities and struggle to efficiently acquire and integrate information in multimodal environments. Key challenges include the trade-off between generalization and specialization in tool usage and the scarcity of multi-hop multimodal training data. To overcome these issues, the authors propose a modular multimodal retrieval agent that explicitly decouples information acquisition from answer generation. They introduce a retrieval-oriented multi-objective reinforcement learning framework that jointly optimizes factual accuracy, reasoning plausibility, and retrieval fidelity. Additionally, they construct MMSearchVQA, the first dataset designed to support multi-hop multimodal retrieval training. Experimental results demonstrate that the proposed approach significantly outperforms existing models on complex multimodal tasks, exhibiting strong transferability and reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Recent advances in DeepResearch-style agents have demonstrated strong capabilities in autonomous information acquisition and synthesize from real-world web environments. However, existing approaches remain fundamentally limited to text modality. Extending autonomous information-seeking agents to multimodal settings introduces critical challenges: the specialization-generalization trade-off that emerges when training models for multimodal tool-use at scale, and the severe scarcity of training data capturing complex, multi-step multimodal search trajectories. To address these challenges, we propose M$^3$Searcher, a modular multimodal information-seeking agent that explicitly decouples information acquisition from answer derivation. M$^3$Searcher is optimized with a retrieval-oriented multi-objective reward that jointly encourages factual accuracy, reasoning soundness, and retrieval fidelity. In addition, we develop MMSearchVQA, a multimodal multi-hop dataset to support retrieval centric RL training. Experimental results demonstrate that M$^3$Searcher outperforms existing approaches, exhibits strong transfer adaptability and effective reasoning in complex multimodal tasks.

Problem

Research questions and friction points this paper is trying to address.

multimodal information seeking

autonomous agents

tool-use generalization

training data scarcity

multi-step search

Innovation

Methods, ideas, or system contributions that make the work stand out.

modular multimodal agent

retrieval-oriented reasoning

multi-objective reward