🤖 AI Summary
Large language models (LLMs) exhibit insufficient reasoning capabilities and low success rates in long-horizon, embodied task planning for real-world robots.
Method: We propose a closed-loop hierarchical subgoal planning framework that constructs a cross-level subgoal tree: a base LLM performs coarse-grained task decomposition, while an environment-state-driven leaf-node termination model dynamically assesses subgoal completion and triggers the next-level planning, thereby closing the perception–planning–execution loop.
Contribution/Results: Our key innovation lies in decoupling task decomposition from execution termination—enabling adaptive, verifiable hierarchical planning. Evaluated on the VirtualHome WAH-NL benchmark and a physical robot platform, our approach achieves 34% and 25% success rates, respectively, substantially outperforming prior methods.
📝 Abstract
The ability to perform reliable long-horizon task planning is crucial for deploying robots in real-world environments. However, directly employing Large Language Models (LLMs) as action sequence generators often results in low success rates due to their limited reasoning ability for long-horizon embodied tasks. In the STEP framework, we construct a subgoal tree through a pair of closed-loop models: a subgoal decomposition model and a leaf node termination model. Within this framework, we develop a hierarchical tree structure that spans from coarse to fine resolutions. The subgoal decomposition model leverages a foundation LLM to break down complex goals into manageable subgoals, thereby spanning the subgoal tree. The leaf node termination model provides real-time feedback based on environmental states, determining when to terminate the tree spanning and ensuring each leaf node can be directly converted into a primitive action. Experiments conducted in both the VirtualHome WAH-NL benchmark and on real robots demonstrate that STEP achieves long-horizon embodied task completion with success rates up to 34% (WAH-NL) and 25% (real robot) outperforming SOTA methods.