AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios

📅 2026-02-26

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Existing benchmarks struggle to evaluate multimodal agents’ capacity for long-horizon, multi-tool collaborative reasoning and manipulation in real-world, complex visual environments. To address this gap, this work introduces a comprehensive multimodal evaluation benchmark spanning seven major categories and 25 subdomains, featuring unprecedentedly challenging, surreal, and extended tool-use tasks that emphasize fine-grained visual understanding and naturalistic tool composition. The framework integrates diverse multimodal interaction capabilities, including web search, image retrieval, page navigation, and code-driven image processing alongside general-purpose programming. Evaluation of state-of-the-art models reveals an overall accuracy below 27.3%, with some tasks requiring more than 25 tool invocations, underscoring a significant performance deficit in current multimodal agents when confronted with complex, realistic scenarios.

Technology Category

Application Category

📝 Abstract

Real-world multimodal agents solve multi-step workflows grounded in visual evidence. For example, an agent can troubleshoot a device by linking a wiring photo to a schematic and validating the fix with online documentation, or plan a trip by interpreting a transit map and checking schedules under routing constraints. However, existing multimodal benchmarks mainly evaluate single-turn visual reasoning or specific tool skills, and they do not fully capture the realism, visual subtlety, and long-horizon tool use that practical agents require. We introduce AgentVista, a benchmark for generalist multimodal agents that spans 25 sub-domains across 7 categories, pairing realistic and detail-rich visual scenarios with natural hybrid tool use. Tasks require long-horizon tool interactions across modalities, including web search, image search, page navigation, and code-based operations for both image processing and general programming. Comprehensive evaluation of state-of-the-art models exposes significant gaps in their ability to carry out long-horizon multimodal tool use. Even the best model in our evaluation, Gemini-3-Pro with tools, achieves only 27.3% overall accuracy, and hard instances can require more than 25 tool-calling turns. We expect AgentVista to accelerate the development of more capable and reliable multimodal agents for realistic and ultra-challenging problem solving.

Problem

Research questions and friction points this paper is trying to address.

multimodal agents

realistic visual scenarios

long-horizon tool use

benchmark evaluation

visual reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal agents

long-horizon tool use

realistic visual scenarios