Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models

📅 2026-04-20

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Current vision-language-action (VLA) models achieve high success rates on standard benchmarks, yet this performance masks fundamental deficiencies in genuine embodied reasoning, conflating cognitive failures with execution limitations. This work introduces the BeTTER diagnostic benchmark, which systematically evaluates VLA models’ reasoning capabilities in dynamic environments through causal interventions—such as spatial perturbations and temporal extrapolation—and kinematic isolation strategies. For the first time, the study uncovers critical issues including lexical-motor shortcuts, behavioral inertia, and semantic feature collapse in VLA models, tracing their origins to architectural bottlenecks like capacity compression and myopic downsampling. Experiments demonstrate that state-of-the-art VLA models suffer catastrophic failures in dynamic tasks, a phenomenon consistently replicated on real robotic platforms—not merely a simulation artifact—highlighting the urgent need for rethinking VLA architectures.

Technology Category

Application Category

📝 Abstract

Recent Vision-Language-Action (VLA) models report impressive success rates on standard robotic benchmarks, fueling optimism about general-purpose physical intelligence. However, recent evidence suggests a systematic misalignment between standard benchmark success and true embodied reasoning, raising the question of whether these high scores reflect genuine cognitive capability. To address this gap, we introduce BeTTER, a diagnostic Benchmark for Testing True Embodied Reasoning in robotic policies. BeTTER applies targeted causal interventions (e.g., spatial layout shifts, temporal extrapolation) while enforcing kinematic isolation to explicitly decouple high-level reasoning failures from low-level execution limits. Through systematic evaluation, we reveal that state-of-the-art VLAs catastrophically fail in dynamic scenarios, exhibiting severe lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse. Crucially, our mechanistic analysis traces these symptoms to fundamental architectural bottlenecks - such as capacity compression and myopic downsampling - which systematically degrade the model's foundational semantic representation. We demonstrate that highly static evaluation protocols effectively mask this degradation by allowing optimization to overfit to sensorimotor priors. Supported by real-world robotic validation, our findings confirm that this representational breakdown is not a simulation artifact, highlighting the critical need for future VLA paradigms to resolve the structural tension between high-frequency control and high-level reasoning.

Problem

Research questions and friction points this paper is trying to address.

embodied reasoning

Vision-Language-Action models

benchmarking

semantic representation

robotic intelligence

Innovation

Methods, ideas, or system contributions that make the work stand out.

embodied reasoning

Vision-Language-Action models

causal intervention