Embodied Instruction Following in Unknown Environments

📅 2024-06-17
🏛️ arXiv.org
📈 Citations: 4
Influential: 0
📄 PDF
🤖 AI Summary
Embodied agents struggle to comprehend and execute complex natural-language instructions (e.g., “make breakfast”, “tidy the room”) in previously unseen domestic environments due to insufficient scene understanding and lack of grounded planning. Method: This paper introduces the first hierarchical embodied instruction-following framework designed explicitly for unknown environments. It unifies task planning with goal-directed active exploration, constructing a dynamic region-wise attentional semantic map that ensures only physically present objects inform feasible action plans. The framework employs a multimodal large language model–driven hierarchical architecture—comprising a high-level task planner and a low-level exploration controller—that integrates vision-language joint reasoning with closed-loop navigation. Contribution/Results: Evaluated on large-scale, house-level scenes, the method successfully completes 204 complex instructions with a 45.09% success rate—significantly outperforming conventional approaches, which fail catastrophically in unseen environments due to absent scene awareness.

Technology Category

Application Category

📝 Abstract
Enabling embodied agents to complete complex human instructions from natural language is crucial to autonomous systems in household services. Conventional methods can only accomplish human instructions in the known environment where all interactive objects are provided to the embodied agent, and directly deploying the existing approaches for the unknown environment usually generates infeasible plans that manipulate non-existing objects. On the contrary, we propose an embodied instruction following (EIF) method for complex tasks in the unknown environment, where the agent efficiently explores the unknown environment to generate feasible plans with existing objects to accomplish abstract instructions. Specifically, we build a hierarchical embodied instruction following framework including the high-level task planner and the low-level exploration controller with multimodal large language models. We then construct a semantic representation map of the scene with dynamic region attention to demonstrate the known visual clues, where the goal of task planning and scene exploration is aligned for human instruction. For the task planner, we generate the feasible step-by-step plans for human goal accomplishment according to the task completion process and the known visual clues. For the exploration controller, the optimal navigation or object interaction policy is predicted based on the generated step-wise plans and the known visual clues. The experimental results demonstrate that our method can achieve 45.09% success rate in 204 complex human instructions such as making breakfast and tidying rooms in large house-level scenes.
Problem

Research questions and friction points this paper is trying to address.

Enabling agents to follow complex human instructions in unknown environments
Generating feasible plans with existing objects in unexplored settings
Aligning task planning and exploration for abstract instruction completion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical framework with multimodal language models
Semantic map with dynamic region attention
Task planner and exploration controller alignment
🔎 Similar Papers
No similar papers found.
Z
Zhenyu Wu
Beijing University of Posts and Telecommunications
Z
Ziwei Wang
Carnegie Mellon University
Xiuwei Xu
Xiuwei Xu
Tsinghua University
computer visionembodied AI
J
Jiwen Lu
Tsinghua University
Haibin Yan
Haibin Yan
Beijing University of Posts and Telecommunications
Computer VisionPattern RecognitionRobotics