VLA-REPLICA: A Low-Cost, Reproducible Benchmark for Real-World Evaluation of Vision-Language-Action Models

πŸ“… 2026-05-20
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

220K/year
πŸ€– AI Summary
This work addresses the lack of a low-cost, reproducible, and task-diverse real-world evaluation benchmark for vision-language-action (VLA) models. To this end, the authors propose a standardized testing platform built upon off-the-shelf robotic hardware, enabling consistent VLA evaluation across global laboratories under both in-distribution and out-of-distribution scenarios. The benchmark eliminates the need for expensive equipment or centralized evaluation by integrating imitation learning with state-of-the-art VLA models, and introduces a real-world evaluation protocol accompanied by a small-scale demonstration dataset for domain adaptation. Experimental results demonstrate that independently deployed systems from different teams yield highly consistent performance metrics, effectively revealing the strengths and limitations of prevailing VLA models and validating the benchmark’s reproducibility, practicality, and broad applicability.
πŸ“ Abstract
Vision-Language-Action (VLA) models have shown strong promise for general-purpose robotic manipulation, but their real-world evaluation remains limited by a lack of accessible, reproducible, and consistent benchmarks. Simulation benchmarks fail to capture real-world complexity, while existing real-world benchmarks often require expensive hardware, centralized evaluation, or are limited in task diversity. We introduce VLA-REPLICA, a low-cost, easily reproducible real-world benchmark for evaluating VLA models. Built from off-the-shelf components, our system can be quickly assembled and replicated across laboratories, providing a consistent environment for policy evaluation anywhere in the world. VLA-REPLICA includes a diverse suite of manipulation tasks and a small-scale demonstration dataset for target-domain adaptation, with real-world evaluation protocols for both in-distribution and out-of-distribution settings. Experiments with imitation learning and state-of-the-art VLA models reveal model strengths and limitations, while consistent results across independently constructed setups demonstrate the reproducibility of our benchmark.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
real-world evaluation
benchmark
reproducibility
robotic manipulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action models
real-world benchmark
low-cost robotics
reproducible evaluation
off-the-shelf components