🤖 AI Summary
Existing static multimodal large language models struggle to perform long-horizon, multi-turn tool interactions in real-world web environments to acquire dynamic information, limiting their applicability to complex multimodal tasks. This work proposes a training paradigm that combines supervised fine-tuning with reinforcement learning to transform static models into multimodal search agents capable of sustained, multi-round interactions, enabling coordinated use of tools such as text search, image retrieval, and web browsing. To support this approach, we design an iterative data synthesis pipeline to generate high-quality multimodal question-answering data and introduce MM-SearchExam, the first benchmark specifically tailored for evaluating multimodal search capabilities. Experiments demonstrate that the resulting agent significantly outperforms existing open-source methods across multiple benchmarks and even surpasses several closed-source commercial models, particularly excelling in challenging tasks.
📝 Abstract
Large models are increasingly becoming autonomous agents that interact with real-world environments and use external tools to augment their static capabilities. However, most recent progress has focused on text-only large language models, which are limited to a single modality and therefore have narrower application scenarios. On the other hand, multimodal large models, while offering stronger perceptual capabilities, remain limited to static knowledge and lack the ability to access and leverage up-to-date web information. In this paper, we propose VSearcher, turning static multimodal model into multimodal search agent capable of long-horizon, multi-turn tool use in real-world web environments, including text search, image search, and web browsing, via reinforcement learning. Specifically, we introduce Iterative Injection Data Synthesis pipeline to generate large-scale, complex multimodal QA questions, which are further filtered with comprehensive metrics to ensure high quality and sufficient difficulty. We then adopt an SFT-then-RL training pipeline to turn base multimodal models to agent capable of multi-turn tool calling in real-world web environments. Besides, we propose a multimodal search benchmark MM-SearchExam dedicated to evaluating search capabilities of multimodal search agents, which proves highly challenging for recent proprietary models. Extensive evaluations across multiple multimodal search benchmarks reveal effectiveness of our method. VSearcher achieves superior performance compared to recent multimodal search agents and even surpasses several proprietary models on multimodal web search tasks.