RADAR: Benchmarking Vision-Language-Action Generalization via Real-World Dynamics, Spatial-Physical Intelligence, and Autonomous Evaluation

๐Ÿ“… 2026-02-11
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Current evaluations of vision-language-action (VLA) models are largely confined to simulated or highly constrained environments, limiting their ability to assess generalization in real-world physical settings. To address this gap, this work proposes RADAR, a novel benchmark that integrates realistic physical dynamics, explicit spatial-physical reasoning tasks, and a fully automated 3D evaluation pipeline, thereby addressing deficiencies in environmental realism, cognitive depth, and scalability of existing benchmarks. RADAR employs a physically simulated environment featuring dynamic object configurations and sensor noise, and implements an automatic evaluation protocol based on metrics such as 3D IoU. Experiments reveal a significant performance drop in state-of-the-art VLA models under realistic dynamic conditionsโ€”3D IoU declines from 0.261 to 0.068โ€”and expose their limited spatial reasoning capabilities, highlighting the fragility of current approaches for real-world deployment.

Technology Category

Application Category

๐Ÿ“ Abstract
VLA models have achieved remarkable progress in embodied intelligence; however, their evaluation remains largely confined to simulations or highly constrained real-world settings. This mismatch creates a substantial reality gap, where strong benchmark performance often masks poor generalization in diverse physical environments. We identify three systemic shortcomings in current benchmarking practices that hinder fair and reliable model comparison. (1) Existing benchmarks fail to model real-world dynamics, overlooking critical factors such as dynamic object configurations, robot initial states, lighting changes, and sensor noise. (2) Current protocols neglect spatial--physical intelligence, reducing evaluation to rote manipulation tasks that do not probe geometric reasoning. (3) The field lacks scalable fully autonomous evaluation, instead relying on simplistic 2D metrics that miss 3D spatial structure or on human-in-the-loop systems that are costly, biased, and unscalable. To address these limitations, we introduce RADAR (Real-world Autonomous Dynamics And Reasoning), a benchmark designed to systematically evaluate VLA generalization under realistic conditions. RADAR integrates three core components: (1) a principled suite of physical dynamics; (2) dedicated tasks that explicitly test spatial reasoning and physical understanding; and (3) a fully autonomous evaluation pipeline based on 3D metrics, eliminating the need for human supervision. We apply RADAR to audit multiple state-of-the-art VLA models and uncover severe fragility beneath their apparent competence. Performance drops precipitously under modest physical dynamics, with the expectation of 3D IoU declining from 0.261 to 0.068 under sensor noise. Moreover, models exhibit limited spatial reasoning capability. These findings position RADAR as a necessary bench toward reliable and generalizable real-world evaluation of VLA models.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
real-world generalization
benchmarking
spatial-physical intelligence
autonomous evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action (VLA)
real-world dynamics
spatial-physical intelligence
autonomous evaluation
3D metrics
Y
Yuhao Chen
Sun Yat-sen University
Zhihao Zhan
Zhihao Zhan
TopXGun Robotics
SLAMSpatial AIRobotics
X
Xiaoxin Lin
Sun Yat-sen University
Z
Zijian Song
Sun Yat-sen University
H
Hao Liu
Sun Yat-sen University
Q
Qinhan Lyu
Sun Yat-sen University
Y
Yubo Zu
Sun Yat-sen University
X
Xiao Chen
Sun Yat-sen University
Z
Zhiyuan Liu
Sun Yat-sen University
Tao Pu
Tao Pu
SUN YAT-SEN UNIVERSITY
Visual UnderstandingEmbodied Intelligence
T
Tianshui Chen
X-Era AI Lab; Guangdong University of Technology
K
Keze Wang
Sun Yat-sen University; Guangdong Key Laboratory of Big Data Analysis and Processing; X-Era AI Lab
Liang Lin
Liang Lin
Fellow of IEEE/IAPR, Professor of Computer Science, Sun Yat-sen University
Embodied AICausal Inference and LearningMultimodal Data Analysis
Guangrun Wang
Guangrun Wang
University of Oxford; AI Research Team at Aistetic
Machine LearningGeneral Intelligence Theory and Application