LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization

📅 2025-10-04
📈 Citations: 0
✨ Influential: 0
📄 PDF
🤖 AI Summary
Existing VLA evaluation benchmarks, notably LIBERO, suffer from flawed train-evaluation protocols that inadvertently encourage models to rely on memorized associations between action sequences and fixed environment layouts—rather than genuine task understanding or generalization—leading to inflated performance estimates. To address this, we propose LIBERO-PRO: the first multi-dimensional perturbation benchmark for visual-language-action (VLA) models. It systematically introduces controlled perturbations across four dimensions—objects, states, instructions, and environments—including object substitution, instruction corruption, and environment reconfiguration—to rigorously assess robustness and zero-shot generalization. Experiments reveal a dramatic performance collapse: state-of-the-art VLA models achieve >90% accuracy on standard LIBERO but drop to 0.0% on LIBERO-PRO, exposing their reliance on spurious memorization rather than compositional reasoning. This work establishes a more rigorous, fair, and challenging evaluation paradigm for VLA models.

Technology Category

Application Category

📝 Abstract
LIBERO has emerged as a widely adopted benchmark for evaluating Vision-Language-Action (VLA) models; however, its current training and evaluation settings are problematic, often leading to inflated performance estimates and preventing fair model comparison. To address these issues, we introduce LIBERO-PRO, an extended LIBERO benchmark that systematically evaluates model performance under reasonable perturbations across four dimensions: manipulated objects, initial states, task instructions, and environments. Experimental results reveal that, although existing models achieve over 90% accuracy under the standard LIBERO evaluation, their performance collapses to 0.0% under our generalized setting. Crucially, this discrepancy exposes the models' reliance on rote memorization of action sequences and environment layouts from the training set, rather than genuine task understanding or environmental perception. For instance, models persist in executing grasping actions when the target object is replaced with irrelevant items, and their outputs remain unchanged even when given corrupted instructions or even messy tokens. These findings expose the severe flaws in current evaluation practices, and we call on the community to abandon misleading methodologies in favor of robust assessments of model generalization and comprehension. Our code is available at: https://github.com/Zxy-MLlab/LIBERO-PRO.
Problem

Research questions and friction points this paper is trying to address.

Addresses inflated performance in VLA model evaluations
Systematically tests robustness across object and environment perturbations
Exposes models' reliance on memorization over genuine understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces LIBERO-PRO benchmark with systematic perturbations
Evaluates models across objects, states, instructions, environments
Exposes reliance on memorization rather than genuine understanding
X
Xueyang Zhou
Huazhong University of Science and Technology
Y
Yangming Xu
Huazhong University of Science and Technology
G
Guiyao Tie
Huazhong University of Science and Technology
Yongchao Chen
Yongchao Chen
Harvard University, Massachusetts Institute of Technology
Robot PlanningFoundation ModelsFormal MethodsMechanicsAI for Science
Guowen Zhang
Guowen Zhang
The Hong Kong Polytechnic University
Computer Vision3D VisionAutonomous Driving
D
Duanfeng Chu
Wuhan University of Technology
P
Pan Zhou
Huazhong University of Science and Technology
L
Lichao Sun
Lehigh University