From Perception to Action: An Interactive Benchmark for Vision Reasoning

📅 2026-02-24

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Current evaluations of vision-language models (VLMs) are largely confined to static, single-turn tasks, failing to assess their capacity for causal reasoning and long-horizon action planning grounded in physical structures—such as geometric configurations, contact, and support relationships—in dynamic environments. To address this gap, this work introduces CHAIN, the first interactive, 3D physics-driven benchmark centered on causal hierarchical reasoning. CHAIN evaluates models’ ability to understand and execute structured action sequences under physical constraints through tasks like mechanical assembly and block stacking within a unified simulation environment. Experiments reveal that even state-of-the-art VLMs struggle to internalize physical causal constraints, exhibiting pronounced deficiencies in long-term planning and the translation from perception to actionable decisions, thereby underscoring the necessity and novelty of this benchmark for embodied intelligence evaluation.

Technology Category

Application Category

📝 Abstract

Understanding the physical structure is essential for real-world applications such as embodied agents, interactive design, and long-horizon manipulation. Yet, prevailing Vision-Language Model (VLM) evaluations still center on structure-agnostic, single-turn setups (e.g., VQA), which fail to assess agents'ability to reason about how geometry, contact, and support relations jointly constrain what actions are possible in a dynamic environment. To address this gap, we introduce the Causal Hierarchy of Actions and Interactions (CHAIN) benchmark, an interactive 3D, physics-driven testbed designed to evaluate whether models can understand, plan, and execute structured action sequences grounded in physical constraints. CHAIN shifts evaluation from passive perception to active problem solving, spanning tasks such as interlocking mechanical puzzles and 3D stacking and packing. We conduct a comprehensive study of state-of-the-art VLMs and diffusion-based models under unified interactive settings. Our results show that top-performing models still struggle to internalize physical structure and causal constraints, often failing to produce reliable long-horizon plans and cannot robustly translate perceived structure into effective actions. The project is available at https://social-ai-studio.github.io/CHAIN/.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models

physical reasoning

interactive evaluation

action planning

3D environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

interactive benchmark

physical reasoning

vision-language models