🤖 AI Summary
To address critical challenges—including poor scalability, irreproducibility, and lack of standardized benchmarks—in evaluating Vision-Language-Action (VLA) models on real robots for embodied intelligence, this paper introduces the first standardized online evaluation framework enabling large-scale parallel physical robot testing. Methodologically, we construct a distributed robot cluster integrated with containerized model deployment, automated task scheduling, structured evaluation protocols, and real-time performance monitoring, thereby establishing an end-to-end closed-loop experimental pipeline. Our contributions are threefold: (1) an open, reproducible real-robot evaluation benchmark; (2) a tenfold increase in test throughput, substantially improving cross-model comparability; and (3) systematic empirical validation of the generalization capability and robustness of multiple state-of-the-art VLA models across diverse physical tasks.
📝 Abstract
Testing on real machines is indispensable for robotic control algorithms. In the context of learning-based algorithms, especially VLA models, demand for large-scale evaluation, i.e. testing a large number of models on a large number of tasks, is becoming increasingly urgent. However, doing this right is highly non-trivial, especially when scalability and reproducibility is taken into account. In this report, we describe our methodology for constructing RoboChallenge, an online evaluation system to test robotic control algorithms, and our survey of recent state-of-the-art VLA models using our initial benchmark Table30.