MMSearch-Plus: A Simple Yet Challenging Benchmark for Multimodal Browsing Agents

📅 2025-08-29

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Existing multimodal browsing benchmarks over-rely on shallow image retrieval and adjacent text matching, failing to assess fine-grained visual reasoning, provenance verification, and long-horizon tool orchestration. To address this, we introduce MMBrow—the first challenging benchmark grounded in realistic browsing behavior—comprising 311 tasks requiring iterative multimodal search, localized visual reasoning (e.g., micro-text, layout, temporal cues), and cross-modal provenance tracing. We propose a spatio-temporal extrapolation method to generate out-of-site factual questions, increasing demands on long-horizon planning and cross-source validation. We design a model-agnostic agent framework integrating image search, text masking, bounding-box localization, and cropped-image retrieval, enabling evaluation of both closed- and open-weight multimodal LLMs. Experiments show the strongest agent’s accuracy improves from 15.1% to 36.0%; Qwen-2.5-VL-72B-Instruct achieves only 6.9% after 20 search steps—revealing systematic bottlenecks in source verification, part-level reasoning, and long-term planning.

Technology Category

Application Category

📝 Abstract

Large multimodal language models (MLLMs) are increasingly deployed as web agents, yet many multimodal browsing benchmarks can be solved by shallow, fixed workflows that lean on high-recall image search and nearby text-masking the genuinely multimodal challenges of fine-grained visual reasoning, provenance verification, and long-horizon tool use. We introduce MMSearch-Plus, a benchmark of 311 tasks that highly demand multimodal understanding while preserving the difficulty profile of strong text-only browsing suites. Each item is constructed to contain multiple weak, localized visual signals that must be extracted, propagated through iterative text-image search, and cross-validated under retrieval noise before answering. Our curation procedure, Spatial-Temporal Extrapolation, seeds questions whose answers require extrapolating from spatial cues (micro-text, part-level appearance, layouts, signage) and temporal traces (broadcast overlays, seasonal context) to out-of-image facts such as events, dates, and venues. We provide a model-agnostic agent framework with browsing tools and evaluate a range of closed and open MLLMs. The strongest agent (o3) attains 15.1% without search and 36.0% accuracy with rollout under our framework, while a strong open-source model (Qwen-2.5-VL-72B-Instruct) achieves 0.0% without search and 6.9% after 20 rounds of search. Beyond answer accuracy, we assess bounding-box production and cropped-image search, and conduct an error analysis that surfaces failures in source verification, part-based reasoning, and long-horizon planning.

Problem

Research questions and friction points this paper is trying to address.

Evaluating multimodal web agents' fine-grained visual reasoning capabilities

Assessing provenance verification under retrieval noise conditions

Testing long-horizon tool use in multimodal browsing tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses multimodal browsing tools for iterative search

Extracts spatial-temporal cues for out-of-image facts

Cross-validates visual signals under retrieval noise

🔎 Similar Papers

MMInA: Benchmarking Multihop Multimodal Internet Agents