EMMOE: A Comprehensive Benchmark for Embodied Mobile Manipulation in Open Environments

📅 2025-03-11

📈 Citations: 0

✨ Influential: 0

career value

243K/year

🤖 AI Summary

Natural-language-driven autonomous home robots lack unified benchmarks, standardized evaluation protocols, and effective modality alignment mechanisms. Method: This paper introduces EMMOE—the first comprehensive embodied mobile manipulation benchmark for open-home environments—featuring a unified hierarchical task framework supporting long-horizon spatial reasoning; a three-dimensional evaluation metric encompassing task diversity, process traceability, and failure-driven re-planning; the EMMOE-100 dataset (100 high-quality tasks) and a dual-modal LLM training subset; and novel trajectory-language alignment modeling with continuous-space action generation, integrated with a DPO-optimized LLM, lightweight navigation/operation models, and multi-level error detection. Contribution/Results: Experiments demonstrate that the proposed HomieBot agent significantly outperforms baselines on EMMOE, validating its generalization capability and quantifiable performance in realistic home settings.

Technology Category

Application Category

📝 Abstract

Developing autonomous home robots controlled by natural language has long been a pursuit of human. While advancements in large language models (LLMs) and embodied intelligence make this goal closer, several challenges persist: the lack of a unified benchmark for more complex robot tasks, limited evaluation methods and metrics, data incompatibility between LLMs and mobile manipulation trajectories. To address these issues, we introduce Embodied Mobile Manipulation in Open Environments (EMMOE), which requires agents to interpret user instructions and execute long-horizon everyday tasks in continuous space. EMMOE seamlessly integrates high-level and low-level embodied tasks into a unified framework, along with three new metrics for more diverse assessment. Additionally, we collect EMMOE-100, which features in various task attributes, detailed process annotations, re-plans after failures, and two sub-datasets for LLM training. Furthermore, we design HomieBot, a sophisticated agent system consists of LLM with Direct Preference Optimization (DPO), light weighted navigation and manipulation models, and multiple error detection mechanisms. Finally, we demonstrate HomieBot's performance and the evaluation of different models and policies.

Problem

Research questions and friction points this paper is trying to address.

Lack of unified benchmark for complex robot tasks

Limited evaluation methods and metrics for mobile manipulation

Data incompatibility between LLMs and manipulation trajectories

Innovation

Methods, ideas, or system contributions that make the work stand out.

EMMOE integrates high-level and low-level tasks.

HomieBot uses LLM with Direct Preference Optimization.

EMMOE-100 includes detailed annotations and re-plans.

🔎 Similar Papers

M3Bench: Benchmarking Whole-body Motion Generation for Mobile Manipulation in 3D Scenes