🤖 AI Summary
Current dual-arm robot evaluation relies solely on binary success rates, obscuring latent deficiencies such as poor inter-arm coordination, grasp slippage, and asymmetric arm usage. To address this, we propose the first structured evaluation framework tailored for dual-arm manipulation: (1) a hierarchical task taxonomy decomposed by skill stages to ensure fine-grained behavioral coverage; (2) behavior-aware diagnostic metrics that quantitatively assess critical dimensions—including coordination, grasp stability, and arm symmetry; and (3) integration of 3,000+ human demonstrations to support imitation learning and strategy interpretability analysis. Experiments demonstrate strong discriminative power even in success-rate-saturated regimes: behavioral metrics exhibit statistically significant correlations with success in over 50% of task–metric pairs, effectively exposing failure modes. The framework establishes a reproducible, attributable benchmark for iterative development and rigorous comparison of dual-arm control algorithms.
📝 Abstract
We present RoboEval, a simulation benchmark and structured evaluation framework designed to reveal the limitations of current bimanual manipulation policies. While prior benchmarks report only binary task success, we show that such metrics often conceal critical weaknesses in policy behavior -- such as poor coordination, slipping during grasping, or asymmetric arm usage. RoboEval introduces a suite of tiered, semantically grounded tasks decomposed into skill-specific stages, with variations that systematically challenge spatial, physical, and coordination capabilities. Tasks are paired with fine-grained diagnostic metrics and 3000+ human demonstrations to support imitation learning. Our experiments reveal that policies with similar success rates diverge in how tasks are executed -- some struggle with alignment, others with temporally consistent bimanual control. We find that behavioral metrics correlate with success in over half of task-metric pairs, and remain informative even when binary success saturates. By pinpointing when and how policies fail, RoboEval enables a deeper, more actionable understanding of robotic manipulation -- and highlights the need for evaluation tools that go beyond success alone.