π€ AI Summary
This work addresses a critical gap in existing robotic benchmarks, which predominantly emphasize skill execution while neglecting systematic evaluation of cognitive reasoning, creative tool use, and adaptability to unexpected scenarios. To bridge this gap, the authors introduce RoboWitsβthe first dual-arm robotic benchmark specifically designed for assessing cognitive and creative problem-solving capabilities. Built upon a multi-agent collaboration framework, RoboWits automatically generates diverse out-of-distribution tasks involving geometric, material, and assembly reasoning, with support for difficulty scaling and extensible task generation. Experimental results demonstrate that current vision-language-action (VLA) models, despite achieving baseline proficiency after single-task fine-tuning, exhibit substantial performance degradation in perturbed or novel contexts, thereby revealing significant limitations in their reasoning flexibility and environmental robustness.
π Abstract
The ability to reason, adapt, and creatively solve problems under unexpected challenges is essential for robots operating in real-world environments. However, current robotic benchmarks primarily emphasize skill-level execution and provide limited insight into such cognitive reasoning capabilities. We introduce RoboWits, a bi-manual robotic benchmark designed to systematically evaluate cognitive reasoning, creative tool use, and robustness to unexpected conditions. To enable scalable construction of high-quality reasoning-centric unexpected scenarios, we propose an automated task generation pipeline formulated as a multi-agent cooperative framework, comprising agents for seed task generation and verification, metric generation, scene generation, and task mutation. Using the pipeline, we curated 30 diverse seed tasks and 208 tasks with mutations and graded difficulty across geometry, material, and assembly-based reasoning. We benchmark popular robot policies, pre-trained VLAs, and oracle-state planners. Our results reveal a significant performance gap: while pre-trained VLAs exhibit preliminary success on seed tasks after single-task fine-tuning, they struggle to perform on mutated tasks, implying their brittleness in manipulation tasks requiring reasoning, strategy adaptation, and robustness to deceptive or constrained environments. Project page is available at https://umass-embodied-agi.github.io/RoboWits.