MMSearch-R1: Incentivizing LMMs to Search

📅 2025-06-25

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Existing RAG and prompt-engineered search agents suffer from rigid workflows, low search efficiency, or excessive API calls in real-world internet environments. Method: We propose an end-to-end reinforcement learning (RL) framework tailored for large multimodal models (LMMs), enabling on-demand, multi-turn, joint text-and-image retrieval. Our approach integrates RL with RAG, multimodal visual question answering (VQA) data construction, and cross-modal retrieval tool orchestration. Contribution/Results: We introduce, for the first time, a result-oriented reward function coupled with an explicit search penalty mechanism, empowering the LMM to autonomously decide *whether* and *when* to search—eliminating fixed-step retrieval constraints. Experiments demonstrate substantial gains over same-scale RAG baselines across multiple knowledge-intensive tasks, matching the performance of significantly larger models while reducing search calls by over 30%, thereby markedly improving retrieval efficiency and practical deployability.

Technology Category

Application Category

📝 Abstract

Robust deployment of large multimodal models (LMMs) in real-world scenarios requires access to external knowledge sources, given the complexity and dynamic nature of real-world information. Existing approaches such as retrieval-augmented generation (RAG) and prompt engineered search agents rely on rigid pipelines, often leading to inefficient or excessive search behaviors. We present MMSearch-R1, the first end-to-end reinforcement learning framework that enables LMMs to perform on-demand, multi-turn search in real-world Internet environments. Our framework integrates both image and text search tools, allowing the model to reason about when and how to invoke them guided by an outcome-based reward with a search penalty. To support training, We collect a multimodal search VQA dataset through a semi-automated pipeline that covers diverse visual and textual knowledge needs and curate a search-balanced subset with both search-required and search-free samples, which proves essential for shaping efficient and on-demand search behavior. Extensive experiments on knowledge-intensive and info-seeking VQA tasks show that our model not only outperforms RAG-based baselines of the same model size, but also matches the performance of a larger RAG-based model while reducing search calls by over 30%. We further analyze key empirical findings to offer actionable insights for advancing research in multimodal search.

Problem

Research questions and friction points this paper is trying to address.

Enables LMMs to perform efficient on-demand multimodal search

Reduces excessive search calls while maintaining performance

Integrates image and text search tools with reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning for on-demand search

Multimodal search with image and text

Semi-automated dataset for balanced training

🔎 Similar Papers

No similar papers found.