🤖 AI Summary
Existing multimodal browsing benchmarks over-rely on shallow image retrieval and adjacent text matching, failing to assess fine-grained visual reasoning, provenance verification, and long-horizon tool orchestration. To address this, we introduce MMBrow—the first challenging benchmark grounded in realistic browsing behavior—comprising 311 tasks requiring iterative multimodal search, localized visual reasoning (e.g., micro-text, layout, temporal cues), and cross-modal provenance tracing. We propose a spatio-temporal extrapolation method to generate out-of-site factual questions, increasing demands on long-horizon planning and cross-source validation. We design a model-agnostic agent framework integrating image search, text masking, bounding-box localization, and cropped-image retrieval, enabling evaluation of both closed- and open-weight multimodal LLMs. Experiments show the strongest agent’s accuracy improves from 15.1% to 36.0%; Qwen-2.5-VL-72B-Instruct achieves only 6.9% after 20 search steps—revealing systematic bottlenecks in source verification, part-level reasoning, and long-term planning.
📝 Abstract
Large multimodal language models (MLLMs) are increasingly deployed as web agents, yet many multimodal browsing benchmarks can be solved by shallow, fixed workflows that lean on high-recall image search and nearby text-masking the genuinely multimodal challenges of fine-grained visual reasoning, provenance verification, and long-horizon tool use. We introduce MMSearch-Plus, a benchmark of 311 tasks that highly demand multimodal understanding while preserving the difficulty profile of strong text-only browsing suites. Each item is constructed to contain multiple weak, localized visual signals that must be extracted, propagated through iterative text-image search, and cross-validated under retrieval noise before answering. Our curation procedure, Spatial-Temporal Extrapolation, seeds questions whose answers require extrapolating from spatial cues (micro-text, part-level appearance, layouts, signage) and temporal traces (broadcast overlays, seasonal context) to out-of-image facts such as events, dates, and venues. We provide a model-agnostic agent framework with browsing tools and evaluate a range of closed and open MLLMs. The strongest agent (o3) attains 15.1% without search and 36.0% accuracy with rollout under our framework, while a strong open-source model (Qwen-2.5-VL-72B-Instruct) achieves 0.0% without search and 6.9% after 20 rounds of search. Beyond answer accuracy, we assess bounding-box production and cropped-image search, and conduct an error analysis that surfaces failures in source verification, part-based reasoning, and long-horizon planning.