DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search

📅 2025-10-14

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Existing multimodal large language models (MLLMs) suffer from limited external knowledge access, rigid search strategies, low-quality queries, and redundant API calls in knowledge-intensive multimodal search. To address these challenges, we propose the first MLLM framework enabling autonomous triggering and on-demand, multi-round, joint text-image retrieval. Our method innovatively introduces region-level visual query generation, iterative query refinement, and a self-reflection mechanism. Training follows a two-stage paradigm: cold-start supervised fine-tuning followed by online reinforcement learning, jointly optimized on our newly constructed DeepMMSearchVQA dataset—featuring text-visual multi-hop reasoning samples. Extensive experiments on multiple knowledge-intensive multimodal benchmarks demonstrate substantial improvements over state-of-the-art RAG systems, search agents, and retrieval-augmented MLLMs. Results validate the framework’s superiority in dynamic, precise, and efficient multimodal web search.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) in real-world applications require access to external knowledge sources and must remain responsive to the dynamic and ever-changing real-world information in order to address information-seeking and knowledge-intensive user queries. Existing approaches, such as retrieval augmented generation (RAG) methods, search agents, and search equipped MLLMs, often suffer from rigid pipelines, excessive search calls, and poorly constructed search queries, which result in inefficiencies and suboptimal outcomes. To address these limitations, we present DeepMMSearch-R1, the first multimodal LLM capable of performing on-demand, multi-turn web searches and dynamically crafting queries for both image and text search tools. Specifically, DeepMMSearch-R1 can initiate web searches based on relevant crops of the input image making the image search more effective, and can iteratively adapt text search queries based on retrieved information, thereby enabling self-reflection and self-correction. Our approach relies on a two-stage training pipeline: a cold start supervised finetuning phase followed by an online reinforcement learning optimization. For training, we introduce DeepMMSearchVQA, a novel multimodal VQA dataset created through an automated pipeline intermixed with real-world information from web search tools. This dataset contains diverse, multi-hop queries that integrate textual and visual information, teaching the model when to search, what to search for, which search tool to use and how to reason over the retrieved information. We conduct extensive experiments across a range of knowledge-intensive benchmarks to demonstrate the superiority of our approach. Finally, we analyze the results and provide insights that are valuable for advancing multimodal web-search.

Problem

Research questions and friction points this paper is trying to address.

Addresses rigid pipelines in multimodal search systems

Reduces excessive search calls through dynamic query crafting

Improves multimodal web search effectiveness with iterative adaptation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal LLM performs on-demand multi-turn web searches

Dynamically crafts queries for image and text search tools

Two-stage training with supervised finetuning and reinforcement learning

🔎 Similar Papers

No similar papers found.