VisualActBench: Can VLMs See and Act like a Human?

📅 2025-12-10

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Current vision-language models (VLMs) exhibit limited capability in proactive reasoning and autonomous decision-making without textual prompts. Method: We introduce VisualActBench—the first large-scale benchmark for visual active behavioral reasoning—comprising 1,074 real-world videos and 3,733 human-annotated actions. We formally define the “visual active behavioral reasoning” task, propose a dual-dimensional annotation schema (Action Priority Level and Active/Passive Type), and design a multi-model zero-/few-shot visual action generation evaluation framework. Contribution/Results: Comprehensive evaluation across 29 state-of-the-art VLMs—including GPT-4o—reveals significant performance gaps versus human annotators in forward-looking prediction and value-sensitive decision-making, exposing fundamental architectural limitations. This work establishes a novel benchmark, task formulation, and evaluation paradigm to advance VLMs toward human-like proactive intelligence.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) have achieved impressive progress in perceiving and describing visual environments. However, their ability to proactively reason and act based solely on visual inputs, without explicit textual prompts, remains underexplored. We introduce a new task, Visual Action Reasoning, and propose VisualActBench, a large-scale benchmark comprising 1,074 videos and 3,733 human-annotated actions across four real-world scenarios. Each action is labeled with an Action Prioritization Level (APL) and a proactive-reactive type to assess models' human-aligned reasoning and value sensitivity. We evaluate 29 VLMs on VisualActBench and find that while frontier models like GPT4o demonstrate relatively strong performance, a significant gap remains compared to human-level reasoning, particularly in generating proactive, high-priority actions. Our results highlight limitations in current VLMs' ability to interpret complex context, anticipate outcomes, and align with human decision-making frameworks. VisualActBench establishes a comprehensive foundation for assessing and improving the real-world readiness of proactive, vision-centric AI agents.

Problem

Research questions and friction points this paper is trying to address.

Assess VLMs' ability to reason and act from visual inputs without text prompts

Evaluate models' human-aligned reasoning and value sensitivity in real-world scenarios

Identify gaps in VLMs' context interpretation, outcome anticipation, and human decision alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Visual Action Reasoning task

Proposes VisualActBench benchmark with videos

Evaluates VLMs on human-aligned action prioritization

🔎 Similar Papers

A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision Language Models