On the Generalization Gap in LLM Planning: Tests and Verifier-Reward RL

πŸ“… 2026-01-20
πŸ“ˆ Citations: 1
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study investigates whether large language models (LLMs) possess transferable planning capabilities for PDDL-based tasks or merely rely on domain-specific memorization. We fine-tune a 1.7B-parameter LLM on ten IPC 2023 domains and evaluate its generalization through both in-domain and cross-domain settings. To diagnose the causes of generalization failure, we introduce three interventions: symbolic anonymization, compact plan serialization, and a verifier-reward reinforcement learning scheme leveraging the VAL validator as a reward signalβ€”a novel contribution of this work. Experimental results show an in-domain planning success rate of 82.9%, yet performance collapses to 0% in cross-domain scenarios. Although the verifier-based reward accelerates convergence, it fails to enhance cross-domain generalization, revealing that LLMs remain highly sensitive to surface-level syntactic forms and lack genuine planning abstraction.

Technology Category

Application Category

πŸ“ Abstract
Recent work shows that fine-tuned Large Language Models (LLMs) can achieve high valid plan rates on PDDL planning tasks. However, it remains unclear whether this reflects transferable planning competence or domain-specific memorization. In this work, we fine-tune a 1.7B-parameter LLM on 40,000 domain-problem-plan tuples from 10 IPC 2023 domains, and evaluate both in-domain and cross-domain generalization. While the model reaches 82.9% valid plan rate in in-domain conditions, it achieves 0% on two unseen domains. To analyze this failure, we introduce three diagnostic interventions, namely (i) instance-wise symbol anonymization, (ii) compact plan serialization, and (iii) verifier-reward fine-tuning using the VAL validator as a success-focused reinforcement signal. Symbol anonymization and compact serialization cause significant performance drops despite preserving plan semantics, thus revealing strong sensitivity to surface representations. Verifier-reward fine-tuning reaches performance saturation in half the supervised training epochs, but does not improve cross-domain generalization. For the explored configurations, in-domain performance plateaus around 80%, while cross-domain performance collapses, suggesting that our fine-tuned model relies heavily on domain-specific patterns rather than transferable planning competence in this setting. Our results highlight a persistent generalization gap in LLM-based planning and provide diagnostic tools for studying its causes.
Problem

Research questions and friction points this paper is trying to address.

generalization gap
LLM planning
cross-domain generalization
domain-specific memorization
planning competence
Innovation

Methods, ideas, or system contributions that make the work stand out.

generalization gap
LLM planning
verifier-reward RL
symbol anonymization
cross-domain generalization
πŸ”Ž Similar Papers
No similar papers found.
V
Valerio Belcamino
Department of Informatics, Bioengineering, Robotics and Systems Engineering, University of Genoa, Viale Causa 13, 16145 Genoa, Italy
N
Nicholas Attolino
Department of Informatics, Bioengineering, Robotics and Systems Engineering, University of Genoa, Viale Causa 13, 16145 Genoa, Italy
A
Alessio Capitanelli
AIKO S.r.l., Via dei Mille 22, 10123, Torino, Italy
Fulvio Mastrogiovanni
Fulvio Mastrogiovanni
University of Genoa, Istituto Italiano di Tecnologia
Cognitive SystemsCognitive RoboticsEmbodied CognitionEmbodied AIPhysical AI