🤖 AI Summary
Existing robotic benchmarks suffer from a disconnect between high-level instruction following and low-level control evaluation: the former assumes perfect execution, while the latter supports only simple, single-step commands—hindering assessment of joint task planning and physical execution. This work introduces Kitchen-R, a high-fidelity kitchen simulation benchmark built on Isaac Sim, enabling the first end-to-end, language-driven evaluation of mobile manipulation robots. It features 500+ complex, multi-step natural-language instructions and supports three evaluation modes: independent planning, joint planning-and-control, and trajectory tracking. Kitchen-R integrates vision-language models for task planning and diffusion-based policy models for low-level control, accompanied by a unified baseline framework and a trajectory recording system. By bridging the evaluation gap between linguistic understanding and embodied execution, Kitchen-R advances standardized benchmarking for embodied AI.
📝 Abstract
Benchmarks are crucial for evaluating progress in robotics and embodied AI. However, a significant gap exists between benchmarks designed for high-level language instruction following, which often assume perfect low-level execution, and those for low-level robot control, which rely on simple, one-step commands. This disconnect prevents a comprehensive evaluation of integrated systems where both task planning and physical execution are critical. To address this, we propose Kitchen-R, a novel benchmark that unifies the evaluation of task planning and low-level control within a simulated kitchen environment. Built as a digital twin using the Isaac Sim simulator and featuring more than 500 complex language instructions, Kitchen-R supports a mobile manipulator robot. We provide baseline methods for our benchmark, including a task-planning strategy based on a vision-language model and a low-level control policy based on diffusion policy. We also provide a trajectory collection system. Our benchmark offers a flexible framework for three evaluation modes: independent assessment of the planning module, independent assessment of the control policy, and, crucially, an integrated evaluation of the whole system. Kitchen-R bridges a key gap in embodied AI research, enabling more holistic and realistic benchmarking of language-guided robotic agents.