π€ AI Summary
This study investigates whether large language models (LLMs) possess transferable planning capabilities for PDDL-based tasks or merely rely on domain-specific memorization. We fine-tune a 1.7B-parameter LLM on ten IPC 2023 domains and evaluate its generalization through both in-domain and cross-domain settings. To diagnose the causes of generalization failure, we introduce three interventions: symbolic anonymization, compact plan serialization, and a verifier-reward reinforcement learning scheme leveraging the VAL validator as a reward signalβa novel contribution of this work. Experimental results show an in-domain planning success rate of 82.9%, yet performance collapses to 0% in cross-domain scenarios. Although the verifier-based reward accelerates convergence, it fails to enhance cross-domain generalization, revealing that LLMs remain highly sensitive to surface-level syntactic forms and lack genuine planning abstraction.
π Abstract
Recent work shows that fine-tuned Large Language Models (LLMs) can achieve high valid plan rates on PDDL planning tasks. However, it remains unclear whether this reflects transferable planning competence or domain-specific memorization. In this work, we fine-tune a 1.7B-parameter LLM on 40,000 domain-problem-plan tuples from 10 IPC 2023 domains, and evaluate both in-domain and cross-domain generalization. While the model reaches 82.9% valid plan rate in in-domain conditions, it achieves 0% on two unseen domains. To analyze this failure, we introduce three diagnostic interventions, namely (i) instance-wise symbol anonymization, (ii) compact plan serialization, and (iii) verifier-reward fine-tuning using the VAL validator as a success-focused reinforcement signal. Symbol anonymization and compact serialization cause significant performance drops despite preserving plan semantics, thus revealing strong sensitivity to surface representations. Verifier-reward fine-tuning reaches performance saturation in half the supervised training epochs, but does not improve cross-domain generalization. For the explored configurations, in-domain performance plateaus around 80%, while cross-domain performance collapses, suggesting that our fine-tuned model relies heavily on domain-specific patterns rather than transferable planning competence in this setting. Our results highlight a persistent generalization gap in LLM-based planning and provide diagnostic tools for studying its causes.