🤖 AI Summary
Embodied agents exhibit insufficient planning capability in complex domestic environments due to linguistic ambiguity, environmental dynamism, and skill constraints. Method: This paper proposes a hierarchical language planning architecture that orchestrates multiple lightweight open-source large language models (e.g., Phi-3, Qwen2) to explicitly decouple semantic parsing, environment perception, and skill scheduling—enabling on-device deployment. Leveraging task decomposition, embodied environment interaction interfaces, and state feedback mechanisms, the architecture robustly maps natural-language instructions to executable action sequences. Contribution/Results: Evaluated in real-world home settings, the approach achieves a 42% higher success rate than single-layer LLM baselines on multi-step tasks (e.g., “brew coffee and deliver it to the living room”) and reduces inference latency by 58%. It establishes the first efficient, hierarchical embodied planning paradigm powered by small-parameter models.
📝 Abstract
Embodied agents tasked with complex scenarios, whether in real or simulated environments, rely heavily on robust planning capabilities. When instructions are formulated in natural language, large language models (LLMs) equipped with extensive linguistic knowledge can play this role. However, to effectively exploit the ability of such models to handle linguistic ambiguity, to retrieve information from the environment, and to be based on the available skills of an agent, an appropriate architecture must be designed. We propose a Hierarchical Embodied Language Planner, called HELP, consisting of a set of LLM-based agents, each dedicated to solving a different subtask. We evaluate the proposed approach on a household task and perform real-world experiments with an embodied agent. We also focus on the use of open source LLMs with a relatively small number of parameters, to enable autonomous deployment.