Benefits and Pitfalls of Reinforcement Learning for Language Model Planning: A Theoretical Perspective

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Reinforcement learning (RL)–enhanced planning in large language models (LLMs) suffers from weak theoretical foundations and reward design vulnerabilities—such as reward hacking and biased policy convergence. Method: We propose a graph-structured abstraction framework for rigorous theoretical analysis, empirically comparing policy gradient (PG) and Q-learning on the Blocksworld planning benchmark. Contribution/Results: We demonstrate that while PG improves accuracy, it collapses output diversity; in contrast, Q-learning—leveraging off-policy learning and explicit value estimation—better preserves behavioral diversity and mitigates reward hacking in offline settings. Our work is the first to systematically establish the critical role of exploration in planning generalization, revealing that supervised fine-tuning introduces spurious correlations, whereas RL-driven exploration enables causal path calibration. Integrating formal analysis with empirical validation, this study establishes a novel, robust paradigm for RL-based LLM planning.

Technology Category

Application Category

📝 Abstract
Recent reinforcement learning (RL) methods have substantially enhanced the planning capabilities of Large Language Models (LLMs), yet the theoretical basis for their effectiveness remains elusive. In this work, we investigate RL's benefits and limitations through a tractable graph-based abstraction, focusing on policy gradient (PG) and Q-learning methods. Our theoretical analyses reveal that supervised fine-tuning (SFT) may introduce co-occurrence-based spurious solutions, whereas RL achieves correct planning primarily through exploration, underscoring exploration's role in enabling better generalization. However, we also show that PG suffers from diversity collapse, where output diversity decreases during training and persists even after perfect accuracy is attained. By contrast, Q-learning provides two key advantages: off-policy learning and diversity preservation at convergence. We further demonstrate that careful reward design is necessary to prevent reward hacking in Q-learning. Finally, applying our framework to the real-world planning benchmark Blocksworld, we confirm that these behaviors manifest in practice.
Problem

Research questions and friction points this paper is trying to address.

Analyzing RL's theoretical benefits and limitations for LLM planning
Investigating exploration-driven generalization versus diversity collapse issues
Demonstrating reward design importance for preventing Q-learning exploitation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses graph-based abstraction to analyze RL methods
Employs policy gradient and Q-learning for planning
Demonstrates exploration enables better generalization