🤖 AI Summary
This work investigates how large language models (LLMs) can perform zero-shot goal recognition from agent behavior sequences without relying on training data. To this end, it introduces the goal recognition task into classical PDDL planning benchmarks for the first time, establishing a novel paradigm to evaluate LLMs’ foundational planning knowledge. Through carefully designed zero-shot prompting strategies, the study systematically analyzes the capacity of several state-of-the-art LLMs to integrate world knowledge with observed evidence. Experimental results reveal that certain models improve their accuracy as more observations become available, approaching the performance of landmark-based methods, while others overly rely on prior knowledge and struggle to effectively incorporate new evidence—highlighting fundamental differences in how these models handle evidence integration.
📝 Abstract
Large language models have recently reached near-parity with classical planners on well-known planning domains, yet this competence relies on world-knowledge exploitation rather than genuine symbolic reasoning. Goal recognition is a complementary abductive task structurally better suited to LLM strengths: it consists of evaluating consistency with world knowledge rather than generating novel action sequences. This paper provides the first systematic zero-shot evaluation of frontier LLMs as goal recognisers on key classical PDDL benchmarks. Our results show that LLM competence on goal recognition is uneven: some models scale with evidence and approach landmark-based accuracy at full observations, while others remain anchored to world-knowledge priors regardless of how much evidence accumulates. Qualitative analysis of model reasoning traces reveals that this divergence reflects a fundamental difference in evidence integration rather than domain familiarity. These findings position goal recognition as a principled benchmark for the foundational planning knowledge of LLMs.