Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks primarily evaluate single-turn, synthetic, unimodal tasks, failing to assess multimodal agents’ multi-step, deep visual reasoning capabilities in real-world scenarios. Method: We introduce VISA, the first real-world-oriented, vision-centric agent benchmark for multi-step reasoning, comprising 828 multimodal tasks (images, videos, multi-image inputs, and text) across six domains—including web navigation and autonomous driving. We propose a fine-grained, step-level evaluation framework that quantifies reasoning quality along two dimensions: logical coherence and tool invocation effectiveness; additionally, we incorporate cross-environment visual-semantic alignment and tool-augmented reasoning chain annotation. Contribution/Results: Experiments reveal that state-of-the-art multimodal large models (GPT-4V, Gemini-Ultra, Qwen-VL) achieve less than 50% success rate on end-to-end tasks, exposing fundamental limitations in long-horizon visual reasoning and tool-augmented collaboration.

Technology Category

Application Category

📝 Abstract
Deep reasoning is fundamental for solving complex tasks, especially in vision-centric scenarios that demand sequential, multimodal understanding. However, existing benchmarks typically evaluate agents with fully synthetic, single-turn queries, limited visual modalities, and lack a framework to assess reasoning quality over multiple steps as required in real-world settings. To address this, we introduce Agent-X, a large-scale benchmark for evaluating vision-centric agents multi-step and deep reasoning capabilities in real-world, multimodal settings. Agent- X features 828 agentic tasks with authentic visual contexts, including images, multi-image comparisons, videos, and instructional text. These tasks span six major agentic environments: general visual reasoning, web browsing, security and surveillance, autonomous driving, sports, and math reasoning. Our benchmark requires agents to integrate tool use with explicit, stepwise decision-making in these diverse settings. In addition, we propose a fine-grained, step-level evaluation framework that assesses the correctness and logical coherence of each reasoning step and the effectiveness of tool usage throughout the task. Our results reveal that even the best-performing models, including GPT, Gemini, and Qwen families, struggle to solve multi-step vision tasks, achieving less than 50% full-chain success. These findings highlight key bottlenecks in current LMM reasoning and tool-use capabilities and identify future research directions in vision-centric agentic reasoning models. Our data and code are publicly available at https://github.com/mbzuai-oryx/Agent-X
Problem

Research questions and friction points this paper is trying to address.

Evaluating deep multimodal reasoning in vision-centric agentic tasks
Addressing lack of benchmarks for multi-step reasoning in real-world settings
Assessing reasoning quality and tool use effectiveness in diverse environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale benchmark for vision-centric agents
Fine-grained step-level evaluation framework
Integration of tool use with decision-making
🔎 Similar Papers
No similar papers found.