🤖 AI Summary
Current benchmarks for vision-language-action (VLA) models struggle to faithfully evaluate their generalization under distribution shifts, robustness, and alignment between instructions and perception. To address this gap, this work proposes LIBERO-X, a novel three-tiered evaluation framework that systematically assesses spatial generalization, object recognition, and task instruction comprehension. By integrating highly diverse human teleoperation data with multi-granular manipulation goals, LIBERO-X bridges the distributional gap between training and evaluation. Through a hierarchical evaluation protocol and cumulative perturbation testing, the framework reveals significant performance degradation in state-of-the-art VLA models under complex disturbances, exposing critical deficiencies in scene understanding and instruction execution. LIBERO-X thus establishes a reliable, fine-grained benchmark to guide future research in embodied AI.
📝 Abstract
Reliable benchmarking is critical for advancing Vision-Language-Action (VLA) models, as it reveals their generalization, robustness, and alignment of perception with language-driven manipulation tasks. However, existing benchmarks often provide limited or misleading assessments due to insufficient evaluation protocols that inadequately capture real-world distribution shifts. This work systematically rethinks VLA benchmarking from both evaluation and data perspectives, introducing LIBERO-X, a benchmark featuring: 1) A hierarchical evaluation protocol with progressive difficulty levels targeting three core capabilities: spatial generalization, object recognition, and task instruction understanding. This design enables fine-grained analysis of performance degradation under increasing environmental and task complexity; 2) A high-diversity training dataset collected via human teleoperation, where each scene supports multiple fine-grained manipulation objectives to bridge the train-evaluation distribution gap. Experiments with representative VLA models reveal significant performance drops under cumulative perturbations, exposing persistent limitations in scene comprehension and instruction grounding. By integrating hierarchical evaluation with diverse training data, LIBERO-X offers a more reliable foundation for assessing and advancing VLA development.