STEP Planner: Constructing cross-hierarchical subgoal tree as an embodied long-horizon task planner

📅 2025-06-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit insufficient reasoning capabilities and low success rates in long-horizon, embodied task planning for real-world robots. Method: We propose a closed-loop hierarchical subgoal planning framework that constructs a cross-level subgoal tree: a base LLM performs coarse-grained task decomposition, while an environment-state-driven leaf-node termination model dynamically assesses subgoal completion and triggers the next-level planning, thereby closing the perception–planning–execution loop. Contribution/Results: Our key innovation lies in decoupling task decomposition from execution termination—enabling adaptive, verifiable hierarchical planning. Evaluated on the VirtualHome WAH-NL benchmark and a physical robot platform, our approach achieves 34% and 25% success rates, respectively, substantially outperforming prior methods.

Technology Category

Application Category

📝 Abstract
The ability to perform reliable long-horizon task planning is crucial for deploying robots in real-world environments. However, directly employing Large Language Models (LLMs) as action sequence generators often results in low success rates due to their limited reasoning ability for long-horizon embodied tasks. In the STEP framework, we construct a subgoal tree through a pair of closed-loop models: a subgoal decomposition model and a leaf node termination model. Within this framework, we develop a hierarchical tree structure that spans from coarse to fine resolutions. The subgoal decomposition model leverages a foundation LLM to break down complex goals into manageable subgoals, thereby spanning the subgoal tree. The leaf node termination model provides real-time feedback based on environmental states, determining when to terminate the tree spanning and ensuring each leaf node can be directly converted into a primitive action. Experiments conducted in both the VirtualHome WAH-NL benchmark and on real robots demonstrate that STEP achieves long-horizon embodied task completion with success rates up to 34% (WAH-NL) and 25% (real robot) outperforming SOTA methods.
Problem

Research questions and friction points this paper is trying to address.

Improving long-horizon task planning for robots using LLMs
Enhancing subgoal decomposition for complex embodied tasks
Increasing success rates in real-world robotic task completion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Subgoal tree construction for long-horizon planning
Hierarchical tree from coarse to fine resolutions
Closed-loop subgoal decomposition and termination models
🔎 Similar Papers
No similar papers found.
T
Tianxing Zhou
School of Automation, Beijing Institute of Technology, Beijing, 100081, China
Zhirui Wang
Zhirui Wang
Aerospace Information Research Institute, Chinese Academy of Sciences
Remote sensing image interpretationtarget detectiontarget recognition
H
Haojia Ao
School of Automation, Beijing Institute of Technology, Beijing, 100081, China
Guangyan Chen
Guangyan Chen
Beijing Institute of Technology
B
Boyang Xing
Humanoid Robot (Shanghai) Co., Ltd., Shanghai, 200093, China
Jingwen Cheng
Jingwen Cheng
Tsinghua University
Deep LearningComplex System
Y
Yi Yang
School of Automation, Beijing Institute of Technology, Beijing, 100081, China
Y
Yufeng Yue
School of Automation, Beijing Institute of Technology, Beijing, 100081, China