Mind and Motion Aligned: A Joint Evaluation IsaacSim Benchmark for Task Planning and Low-Level Policies in Mobile Manipulation

📅 2025-08-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing robotic benchmarks suffer from a disconnect between high-level instruction following and low-level control evaluation: the former assumes perfect execution, while the latter supports only simple, single-step commands—hindering assessment of joint task planning and physical execution. This work introduces Kitchen-R, a high-fidelity kitchen simulation benchmark built on Isaac Sim, enabling the first end-to-end, language-driven evaluation of mobile manipulation robots. It features 500+ complex, multi-step natural-language instructions and supports three evaluation modes: independent planning, joint planning-and-control, and trajectory tracking. Kitchen-R integrates vision-language models for task planning and diffusion-based policy models for low-level control, accompanied by a unified baseline framework and a trajectory recording system. By bridging the evaluation gap between linguistic understanding and embodied execution, Kitchen-R advances standardized benchmarking for embodied AI.

Technology Category

Application Category

📝 Abstract
Benchmarks are crucial for evaluating progress in robotics and embodied AI. However, a significant gap exists between benchmarks designed for high-level language instruction following, which often assume perfect low-level execution, and those for low-level robot control, which rely on simple, one-step commands. This disconnect prevents a comprehensive evaluation of integrated systems where both task planning and physical execution are critical. To address this, we propose Kitchen-R, a novel benchmark that unifies the evaluation of task planning and low-level control within a simulated kitchen environment. Built as a digital twin using the Isaac Sim simulator and featuring more than 500 complex language instructions, Kitchen-R supports a mobile manipulator robot. We provide baseline methods for our benchmark, including a task-planning strategy based on a vision-language model and a low-level control policy based on diffusion policy. We also provide a trajectory collection system. Our benchmark offers a flexible framework for three evaluation modes: independent assessment of the planning module, independent assessment of the control policy, and, crucially, an integrated evaluation of the whole system. Kitchen-R bridges a key gap in embodied AI research, enabling more holistic and realistic benchmarking of language-guided robotic agents.
Problem

Research questions and friction points this paper is trying to address.

Bridging the gap between task planning and low-level robot control
Evaluating integrated systems with both language instructions and physical execution
Providing comprehensive benchmarking for mobile manipulation in simulated environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Kitchen-R benchmark unifies task planning and control
Vision-language model for task planning strategy
Diffusion policy for low-level control implementation