RoboArena: Distributed Real-World Evaluation of Generalist Robot Policies

📅 2025-06-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
How to conduct unbiased, scalable, cross-task and cross-environment real-world evaluation of general-purpose robotic policies? This paper introduces RoboArena: a decentralized evaluation framework leveraging a multi-institutional, distributed network of evaluators and employing a double-blind pairwise comparison paradigm, enabling participants to freely define tasks and environments. Its core innovation lies in integrating preference learning with distributed experimental design to enable fair aggregation and ranking of heterogeneous policies across diverse real-world settings. Built upon the DROID platform, RoboArena conducts over 600 real-robot pairwise evaluations among seven state-of-the-art policies. Results demonstrate that RoboArena significantly outperforms conventional centralized benchmarks in evaluation accuracy, scalability, and robustness—establishing a more credible and sustainably evolving open evaluation infrastructure for general-purpose robotic policies.

Technology Category

Application Category

📝 Abstract
Comprehensive, unbiased, and comparable evaluation of modern generalist policies is uniquely challenging: existing approaches for robot benchmarking typically rely on heavy standardization, either by specifying fixed evaluation tasks and environments, or by hosting centralized ''robot challenges'', and do not readily scale to evaluating generalist policies across a broad range of tasks and environments. In this work, we propose RoboArena, a new approach for scalable evaluation of generalist robot policies in the real world. Instead of standardizing evaluations around fixed tasks, environments, or locations, we propose to crowd-source evaluations across a distributed network of evaluators. Importantly, evaluators can freely choose the tasks and environments they evaluate on, enabling easy scaling of diversity, but they are required to perform double-blind evaluations over pairs of policies. Then, by aggregating preference feedback from pairwise comparisons across diverse tasks and environments, we can derive a ranking of policies. We instantiate our approach across a network of evaluators at seven academic institutions using the DROID robot platform. Through more than 600 pairwise real-robot evaluation episodes across seven generalist policies, we demonstrate that our crowd-sourced approach can more accurately rank the performance of existing generalist policies than conventional, centralized evaluation approaches, while being more scalable, resilient, and trustworthy. We open our evaluation network to the community and hope that it can enable more accessible comparisons of generalist robot policies.
Problem

Research questions and friction points this paper is trying to address.

Scalable real-world evaluation of generalist robot policies
Crowd-sourced diverse task and environment assessments
Accurate policy ranking via distributed pairwise comparisons
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distributed crowd-sourced real-world evaluations
Double-blind pairwise policy comparisons
Aggregated preference feedback for policy ranking
🔎 Similar Papers
2024-03-19Robotics: Science and SystemsCitations: 151
Pranav Atreya
Pranav Atreya
UC Berkeley
RoboticsReinforcement LearningSelf-supervised LearningNatural Language Processing
Karl Pertsch
Karl Pertsch
UC Berkeley, Stanford University
Artificial IntelligenceMachine LearningRobotics
T
Tony Lee
Stanford University
Moo Jin Kim
Moo Jin Kim
Stanford University
machine learningroboticsreinforcement learning
Arhan Jain
Arhan Jain
University of Washington
Artur Kuramshin
Artur Kuramshin
MSc of Science Student, Université de Montréal
roboticsreinforcement learningdeep learningcomputer vision
Clemens Eppner
Clemens Eppner
NVIDIA Research
Robotics
Cyrus Neary
Cyrus Neary
The University of British Columbia
artificial intelligencereinforcement learningmachine learningmultiagent systemscontrol
Edward Hu
Edward Hu
OpenAI
Deep LearningGenerative ModelsReasoning
Fabio Ramos
Fabio Ramos
University of Sydney and NVIDIA
roboticsmachine learning
Jonathan Tremblay
Jonathan Tremblay
NVIDIA
artificial intelligencerobotics
K
Kanav Arora
University of Washington
K
Kirsty Ellis
University of Montreal
L
Luca Macesanu
University of Pennsylvania
M
Matthew Leonard
University of Pennsylvania
M
Meedeum Cho
Yonsei University
O
Ozgur Aslan
University of Montreal
Shivin Dass
Shivin Dass
PhD Student, UT Austin
Artificial IntelligenceMachine LearningRobot Learning
J
Jie Wang
University of Pennsylvania
X
Xingfang Yuan
University of Pennsylvania
Xuning Yang
Xuning Yang
NVIDIA, Carnegie Mellon University
robotics
A
Abhishek Gupta
University of Washington
Dinesh Jayaraman
Dinesh Jayaraman
Assistant Professor, University of Pennsylvania
robot learningcomputer visionroboticsmachine learning
Glen Berseth
Glen Berseth
Assitant Professor - Université de Montréal
Reinforcement LearningRoboticsDeep LearningMachine Learning
Kostas Daniilidis
Kostas Daniilidis
Ruth Yalom Stone Professor of Computer and Information Science, University of Pennsylvania
Computer VisionRobotics