Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models

📅 2026-01-29

📈 Citations: 3

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Existing vision-language foundation models are constrained by single-step, coarse-grained image-text retrieval, limiting their ability to effectively integrate multi-source evidence for deep reasoning in high-noise real-world scenarios. This work proposes a novel multi-turn, multi-entity, multi-scale visual-textual joint retrieval paradigm and introduces the first scalable multimodal deep research framework capable of supporting dozens of reasoning steps and hundreds of tool interactions. By combining cold-start supervision with reinforcement learning, the model internalizes deep research capabilities into a multimodal large language model, enabling end-to-end complex question answering. The proposed approach significantly outperforms current state-of-the-art methods, achieving superior performance on challenging multimodal tasks compared to workflows built upon closed-source models such as GPT-5, Gemini-2.5-Pro, and Claude-4-Sonnet.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) have achieved remarkable success across a broad range of vision tasks. However, constrained by the capacity of their internal world knowledge, prior work has proposed augmenting MLLMs by ``reasoning-then-tool-call''for visual and textual search engines to obtain substantial gains on tasks requiring extensive factual information. However, these approaches typically define multimodal search in a naive setting, assuming that a single full-level or entity-level image query and few text query suffices to retrieve the key evidence needed to answer the question, which is unrealistic in real-world scenarios with substantial visual noise. Moreover, they are often limited in the reasoning depth and search breadth, making it difficult to solve complex questions that require aggregating evidence from diverse visual and textual sources. Building on this, we propose Vision-DeepResearch, which proposes one new multimodal deep-research paradigm, i.e., performs multi-turn, multi-entity and multi-scale visual and textual search to robustly hit real-world search engines under heavy noise. Our Vision-DeepResearch supports dozens of reasoning steps and hundreds of engine interactions, while internalizing deep-research capabilities into the MLLM via cold-start supervision and RL training, resulting in a strong end-to-end multimodal deep-research MLLM. It substantially outperforming existing multimodal deep-research MLLMs, and workflows built on strong closed-source foundation model such as GPT-5, Gemini-2.5-pro and Claude-4-Sonnet. The code will be released in https://github.com/Osilly/Vision-DeepResearch.

Problem

Research questions and friction points this paper is trying to address.

multimodal large language models

deep research

visual noise

evidence aggregation

complex question answering

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal deep research

multi-turn search

multi-entity retrieval