π€ AI Summary
Current LLM-based planning research lacks the methodological rigor accumulated over six decades of classical automated planning, leading to recurring issues such as modeling bias and unreliable evaluation. To address this, we proposeβ for the first timeβa principled integration of classical planning paradigms (PDDL modeling, heuristic search, and standardized benchmark suites) with LLM-based reasoning, establishing a tripartite rigor framework encompassing problem modeling, benchmarking, and reproducible evaluation. We design a hybrid evaluation protocol and cross-paradigm analysis tools to foster community-wide consensus on evaluation standards. Our approach significantly reduces methodological error rates and provides both theoretical foundations and practical guidelines for developing trustworthy, interpretable, and verifiable LLM-based planners.
π Abstract
In over sixty years since its inception, the field of planning has made significant contributions to both the theory and practice of building planning software that can solve a never-before-seen planning problem. This was done through established practices of rigorous design and evaluation of planning systems. It is our position that this rigor should be applied to the current trend of work on planning with large language models. One way to do so is by correctly incorporating the insights, tools, and data from the automated planning community into the design and evaluation of LLM-based planners. The experience and expertise of the planning community are not just important from a historical perspective; the lessons learned could play a crucial role in accelerating the development of LLM-based planners. This position is particularly important in light of the abundance of recent works that replicate and propagate the same pitfalls that the planning community has encountered and learned from. We believe that avoiding such known pitfalls will contribute greatly to the progress in building LLM-based planners and to planning in general.