RADAR: Benchmarking Vision-Language-Action Generalization via Real-World Dynamics, Spatial-Physical Intelligence, and Autonomous Evaluation

📅 2026-02-11

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Current evaluations of vision-language-action (VLA) models are largely confined to simulated or highly constrained environments, limiting their ability to assess generalization in real-world physical settings. To address this gap, this work proposes RADAR, a novel benchmark that integrates realistic physical dynamics, explicit spatial-physical reasoning tasks, and a fully automated 3D evaluation pipeline, thereby addressing deficiencies in environmental realism, cognitive depth, and scalability of existing benchmarks. RADAR employs a physically simulated environment featuring dynamic object configurations and sensor noise, and implements an automatic evaluation protocol based on metrics such as 3D IoU. Experiments reveal a significant performance drop in state-of-the-art VLA models under realistic dynamic conditions—3D IoU declines from 0.261 to 0.068—and expose their limited spatial reasoning capabilities, highlighting the fragility of current approaches for real-world deployment.

Technology Category

Application Category

📝 Abstract

VLA models have achieved remarkable progress in embodied intelligence; however, their evaluation remains largely confined to simulations or highly constrained real-world settings. This mismatch creates a substantial reality gap, where strong benchmark performance often masks poor generalization in diverse physical environments. We identify three systemic shortcomings in current benchmarking practices that hinder fair and reliable model comparison. (1) Existing benchmarks fail to model real-world dynamics, overlooking critical factors such as dynamic object configurations, robot initial states, lighting changes, and sensor noise. (2) Current protocols neglect spatial--physical intelligence, reducing evaluation to rote manipulation tasks that do not probe geometric reasoning. (3) The field lacks scalable fully autonomous evaluation, instead relying on simplistic 2D metrics that miss 3D spatial structure or on human-in-the-loop systems that are costly, biased, and unscalable. To address these limitations, we introduce RADAR (Real-world Autonomous Dynamics And Reasoning), a benchmark designed to systematically evaluate VLA generalization under realistic conditions. RADAR integrates three core components: (1) a principled suite of physical dynamics; (2) dedicated tasks that explicitly test spatial reasoning and physical understanding; and (3) a fully autonomous evaluation pipeline based on 3D metrics, eliminating the need for human supervision. We apply RADAR to audit multiple state-of-the-art VLA models and uncover severe fragility beneath their apparent competence. Performance drops precipitously under modest physical dynamics, with the expectation of 3D IoU declining from 0.261 to 0.068 under sensor noise. Moreover, models exhibit limited spatial reasoning capability. These findings position RADAR as a necessary bench toward reliable and generalizable real-world evaluation of VLA models.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action

real-world generalization

benchmarking

spatial-physical intelligence

autonomous evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action (VLA)

real-world dynamics

spatial-physical intelligence