🤖 AI Summary
Embodied agents struggle to comprehend and execute complex natural-language instructions (e.g., “make breakfast”, “tidy the room”) in previously unseen domestic environments due to insufficient scene understanding and lack of grounded planning.
Method: This paper introduces the first hierarchical embodied instruction-following framework designed explicitly for unknown environments. It unifies task planning with goal-directed active exploration, constructing a dynamic region-wise attentional semantic map that ensures only physically present objects inform feasible action plans. The framework employs a multimodal large language model–driven hierarchical architecture—comprising a high-level task planner and a low-level exploration controller—that integrates vision-language joint reasoning with closed-loop navigation.
Contribution/Results: Evaluated on large-scale, house-level scenes, the method successfully completes 204 complex instructions with a 45.09% success rate—significantly outperforming conventional approaches, which fail catastrophically in unseen environments due to absent scene awareness.
📝 Abstract
Enabling embodied agents to complete complex human instructions from natural language is crucial to autonomous systems in household services. Conventional methods can only accomplish human instructions in the known environment where all interactive objects are provided to the embodied agent, and directly deploying the existing approaches for the unknown environment usually generates infeasible plans that manipulate non-existing objects. On the contrary, we propose an embodied instruction following (EIF) method for complex tasks in the unknown environment, where the agent efficiently explores the unknown environment to generate feasible plans with existing objects to accomplish abstract instructions. Specifically, we build a hierarchical embodied instruction following framework including the high-level task planner and the low-level exploration controller with multimodal large language models. We then construct a semantic representation map of the scene with dynamic region attention to demonstrate the known visual clues, where the goal of task planning and scene exploration is aligned for human instruction. For the task planner, we generate the feasible step-by-step plans for human goal accomplishment according to the task completion process and the known visual clues. For the exploration controller, the optimal navigation or object interaction policy is predicted based on the generated step-wise plans and the known visual clues. The experimental results demonstrate that our method can achieve 45.09% success rate in 204 complex human instructions such as making breakfast and tidying rooms in large house-level scenes.