RoboEval: Where Robotic Manipulation Meets Structured and Scalable Evaluation

📅 2025-07-01

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Current dual-arm robot evaluation relies solely on binary success rates, obscuring latent deficiencies such as poor inter-arm coordination, grasp slippage, and asymmetric arm usage. To address this, we propose the first structured evaluation framework tailored for dual-arm manipulation: (1) a hierarchical task taxonomy decomposed by skill stages to ensure fine-grained behavioral coverage; (2) behavior-aware diagnostic metrics that quantitatively assess critical dimensions—including coordination, grasp stability, and arm symmetry; and (3) integration of 3,000+ human demonstrations to support imitation learning and strategy interpretability analysis. Experiments demonstrate strong discriminative power even in success-rate-saturated regimes: behavioral metrics exhibit statistically significant correlations with success in over 50% of task–metric pairs, effectively exposing failure modes. The framework establishes a reproducible, attributable benchmark for iterative development and rigorous comparison of dual-arm control algorithms.

Technology Category

Application Category

📝 Abstract

We present RoboEval, a simulation benchmark and structured evaluation framework designed to reveal the limitations of current bimanual manipulation policies. While prior benchmarks report only binary task success, we show that such metrics often conceal critical weaknesses in policy behavior -- such as poor coordination, slipping during grasping, or asymmetric arm usage. RoboEval introduces a suite of tiered, semantically grounded tasks decomposed into skill-specific stages, with variations that systematically challenge spatial, physical, and coordination capabilities. Tasks are paired with fine-grained diagnostic metrics and 3000+ human demonstrations to support imitation learning. Our experiments reveal that policies with similar success rates diverge in how tasks are executed -- some struggle with alignment, others with temporally consistent bimanual control. We find that behavioral metrics correlate with success in over half of task-metric pairs, and remain informative even when binary success saturates. By pinpointing when and how policies fail, RoboEval enables a deeper, more actionable understanding of robotic manipulation -- and highlights the need for evaluation tools that go beyond success alone.

Problem

Research questions and friction points this paper is trying to address.

Revealing limitations in current bimanual manipulation policies

Challenging spatial, physical, and coordination capabilities systematically

Providing fine-grained metrics beyond binary task success

Innovation

Methods, ideas, or system contributions that make the work stand out.

Tiered tasks with skill-specific stages

Fine-grained diagnostic metrics

3000+ human demonstrations for imitation

🔎 Similar Papers

No similar papers found.