π€ AI Summary
Existing evaluation methods struggle to effectively assess the correctness, efficiency, and procedural rationality of multimodal large language models when jointly invoking visual tools and web search in real-world tasks. To address this gap, this work introduces a process-verifiable benchmark for multimodal agents, comprising 418 realistic tasks and over 2,000 human-annotated step-level checkpoints. The benchmark supports sandboxed code and API execution and proposes a novel dual-axis (S-axis and V-axis) reference trajectory framework to enable fine-grained process auditing. It is the first to facilitate automatic, process-oriented evaluation of intermediate states and introduces an βoverthinkingβ metric to quantify reasoning efficiency. Experiments reveal that even the current best model, Gemini-1.5-Pro, achieves only a 56.3% overall accuracy, which drops sharply to 23.0% on the most challenging tasks, highlighting significant limitations in complex real-world scenarios.
π Abstract
Multimodal Large Language Models (MLLMs) are evolving from passive observers into active agents, solving problems through Visual Expansion (invoking visual tools) and Knowledge Expansion (open-web search). However, existing evaluations fall short: they lack flexible tool integration, test visual and search tools separately, and evaluate primarily by final answers. Consequently, they cannot verify if tools were actually invoked, applied correctly, or used efficiently. To address this, we introduce Agentic-MME, a process-verified benchmark for Multimodal Agentic Capabilities. It contains 418 real-world tasks across 6 domains and 3 difficulty levels to evaluate capability synergy, featuring over 2,000 stepwise checkpoints that average 10+ person-hours of manual annotation per task. Each task includes a unified evaluation framework supporting sandboxed code and APIs, alongside a human reference trajectory annotated with stepwise checkpoints along dual-axis: S-axis and V-axis. To enable true process-level verification, we audit fine-grained intermediate states rather than just final answers, and quantify efficiency via an overthinking metric relative to human trajectories. Experimental results show the best model, Gemini3-pro, achieves 56.3% overall accuracy, which falls significantly to 23.0% on Level-3 tasks, underscoring the difficulty of real-world multimodal agentic problem solving.