Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

📅 2026-02-02

📈 Citations: 2

✨ Influential: 0

career value

222K/year

🤖 AI Summary

Existing multimodal large language model evaluation benchmarks for vision-text joint retrieval tasks suffer from answer leakage and overly idealized scenarios, limiting their ability to reflect real-world performance. To address this, this work proposes VDR-Bench—a visual question answering benchmark comprising 2,000 carefully curated samples subjected to multi-stage human filtering and expert review—and introduces the first vision-centric evaluation framework. By designing questions that mitigate textual cue leakage and interference from prior knowledge, and integrating a multi-round image cropping-based retrieval mechanism, the framework more authentically simulates complex vision-language retrieval scenarios. Experiments demonstrate that VDR-Bench effectively exposes the limitations of current models in practical visual retrieval tasks, and that the proposed multi-round cropping strategy significantly enhances model performance in complex visual information acquisition.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) have advanced VQA and now support Vision-DeepResearch systems that use search engines for complex visual-textual fact-finding. However, evaluating these visual and textual search abilities is still difficult, and existing benchmarks have two major limitations. First, existing benchmarks are not visual search-centric: answers that should require visual search are often leaked through cross-textual cues in the text questions or can be inferred from the prior world knowledge in current MLLMs. Second, overly idealized evaluation scenario: On the image-search side, the required information can often be obtained via near-exact matching against the full image, while the text-search side is overly direct and insufficiently challenging. To address these issues, we construct the Vision-DeepResearch benchmark (VDR-Bench) comprising 2,000 VQA instances. All questions are created via a careful, multi-stage curation pipeline and rigorous expert review, designed to assess the behavior of Vision-DeepResearch systems under realistic real-world conditions. Moreover, to address the insufficient visual retrieval capabilities of current MLLMs, we propose a simple multi-round cropped-search workflow. This strategy is shown to effectively improve model performance in realistic visual retrieval scenarios. Overall, our results provide practical guidance for the design of future multimodal deep-research systems. The code will be released in https://github.com/Osilly/Vision-DeepResearch.

Problem

Research questions and friction points this paper is trying to address.

multimodal large language models

visual search

textual search

benchmark evaluation

vision-deepresearch

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-DeepResearch

multimodal retrieval

visual search