🤖 AI Summary
This study addresses the lack of systematic integration of large language models (LLMs) in hierarchical planning (HP). We propose the first comprehensive methodology framework and dedicated evaluation benchmark covering the full HP lifecycle. Methodologically, we formalize HP tasks grounded in automated planning theory, integrating LLM-based reasoning with classical planners (e.g., FF, SHOP2) and establishing a reproducible comparative evaluation pipeline. Key contributions include: (1) releasing the first open-source HP-LLM benchmark dataset; (2) establishing critical performance baselines—current LLM-based planners achieve only 3% zero-shot full-plan correctness and similarly low task-decomposition accuracy; and (3) providing a standardized evaluation protocol and taxonomy for HP-LLM integration, thereby filling a fundamental gap in the systematic application of LLMs to hierarchical planning.
📝 Abstract
Recent advances in Large Language Models (LLMs) are fostering their integration into several reasoning-related fields, including Automated Planning (AP). However, their integration into Hierarchical Planning (HP), a subfield of AP that leverages hierarchical knowledge to enhance planning performance, remains largely unexplored. In this preliminary work, we propose a roadmap to address this gap and harness the potential of LLMs for HP. To this end, we present a taxonomy of integration methods, exploring how LLMs can be utilized within the HP life cycle. Additionally, we provide a benchmark with a standardized dataset for evaluating the performance of future LLM-based HP approaches, and present initial results for a state-of-the-art HP planner and LLM planner. As expected, the latter exhibits limited performance (3% correct plans, and none with a correct hierarchical decomposition) but serves as a valuable baseline for future approaches.