Can LLM-Reasoning Models Replace Classical Planning? A Benchmark Study

📅 2025-07-31

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work systematically evaluates the feasibility of replacing classical planners (e.g., Fast Downward) with large language models (LLMs) for robot task planning. Methodologically, it introduces zero-shot PDDL prompting—feeding domain definition files directly to LLMs across multiple benchmarks—and quantifies plan executability via execution fidelity. Results show that while LLMs achieve moderate success on simple tasks, their performance degrades substantially on complex scenarios, revealing fundamental limitations in maintaining state consistency, modeling resource constraints, and performing precise logical reasoning. The key contributions are: (1) the first cross-benchmark evaluation framework comparing LLMs and classical planners specifically for robot planning; (2) empirical identification of critical weaknesses in LLMs’ structured reasoning capabilities; and (3) proposal of an “LLM + classical planner” hybrid paradigm as a practical evolutionary path toward robust, scalable robotic planning systems.

Technology Category

Application Category

📝 Abstract

Recent advancements in Large Language Models have sparked interest in their potential for robotic task planning. While these models demonstrate strong generative capabilities, their effectiveness in producing structured and executable plans remains uncertain. This paper presents a systematic evaluation of a broad spectrum of current state of the art language models, each directly prompted using Planning Domain Definition Language domain and problem files, and compares their planning performance with the Fast Downward planner across a variety of benchmarks. In addition to measuring success rates, we assess how faithfully the generated plans translate into sequences of actions that can actually be executed, identifying both strengths and limitations of using these models in this setting. Our findings show that while the models perform well on simpler planning tasks, they continue to struggle with more complex scenarios that require precise resource management, consistent state tracking, and strict constraint compliance. These results underscore fundamental challenges in applying language models to robotic planning in real world environments. By outlining the gaps that emerge during execution, we aim to guide future research toward combined approaches that integrate language models with classical planners in order to enhance the reliability and scalability of planning in autonomous robotics.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to generate structured, executable robotic task plans

Comparing LLM planning performance with classical planners like Fast Downward

Identifying LLM limitations in complex planning requiring precision and constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates LLMs using PDDL domain files

Compares LLM plans with Fast Downward planner

Proposes integrating LLMs with classical planners

🔎 Similar Papers

Can Large Language Models be Good Path Planners? A Benchmark and Investigation on Spatial-temporal Reasoning