VLA-REPLICA: A Low-Cost, Reproducible Benchmark for Real-World Evaluation of Vision-Language-Action Models

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the lack of a low-cost, reproducible, and task-diverse real-world evaluation benchmark for vision-language-action (VLA) models. To this end, the authors propose a standardized testing platform built upon off-the-shelf robotic hardware, enabling consistent VLA evaluation across global laboratories under both in-distribution and out-of-distribution scenarios. The benchmark eliminates the need for expensive equipment or centralized evaluation by integrating imitation learning with state-of-the-art VLA models, and introduces a real-world evaluation protocol accompanied by a small-scale demonstration dataset for domain adaptation. Experimental results demonstrate that independently deployed systems from different teams yield highly consistent performance metrics, effectively revealing the strengths and limitations of prevailing VLA models and validating the benchmark’s reproducibility, practicality, and broad applicability.

📝 Abstract

Vision-Language-Action (VLA) models have shown strong promise for general-purpose robotic manipulation, but their real-world evaluation remains limited by a lack of accessible, reproducible, and consistent benchmarks. Simulation benchmarks fail to capture real-world complexity, while existing real-world benchmarks often require expensive hardware, centralized evaluation, or are limited in task diversity. We introduce VLA-REPLICA, a low-cost, easily reproducible real-world benchmark for evaluating VLA models. Built from off-the-shelf components, our system can be quickly assembled and replicated across laboratories, providing a consistent environment for policy evaluation anywhere in the world. VLA-REPLICA includes a diverse suite of manipulation tasks and a small-scale demonstration dataset for target-domain adaptation, with real-world evaluation protocols for both in-distribution and out-of-distribution settings. Experiments with imitation learning and state-of-the-art VLA models reveal model strengths and limitations, while consistent results across independently constructed setups demonstrate the reproducibility of our benchmark.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action

real-world evaluation

benchmark

reproducibility

robotic manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action models

real-world benchmark

low-cost robotics