🤖 AI Summary
Small-scale open-source large language models (LLMs) lack explicit planning capabilities, hindering their reasoning performance and generalization on complex problem-solving tasks.
Method: We propose a unified post-training framework that distills synthetic planning trajectories—i.e., task decomposition paths—generated by stronger LMs. The framework jointly employs supervised learning to imitate stepwise decomposition and reinforcement learning to optimize final answer correctness, thereby inducing step-by-step planning behavior in smaller models without architectural modifications or inference-time overhead.
Contribution/Results: Our approach significantly enhances complex reasoning: it outperforms strong baselines by an average of 7% on GSM8K and MATH, and achieves ~10% and ~12% gains on OlympiadBench and AIME 2024, respectively. These results demonstrate the effectiveness and cross-domain generalizability of planning-structure distillation for boosting small-model reasoning.
📝 Abstract
Recently, decomposing complex problems into simple subtasks--a crucial part of human-like natural planning--to solve the given problem has significantly boosted the performance of large language models (LLMs). However, leveraging such planning structures during post-training to boost the performance of smaller open-source LLMs remains underexplored. Motivated by this, we introduce PLAN-TUNING, a unified post-training framework that (i) distills synthetic task decompositions (termed "planning trajectories") from large-scale LLMs and (ii) fine-tunes smaller models via supervised and reinforcement-learning objectives designed to mimic these planning processes to improve complex reasoning. On GSM8k and the MATH benchmarks, plan-tuned models outperform strong baselines by an average $sim7%$. Furthermore, plan-tuned models show better generalization capabilities on out-of-domain datasets, with average $sim10%$ and $sim12%$ performance improvements on OlympiadBench and AIME 2024, respectively. Our detailed analysis demonstrates how planning trajectories improves complex reasoning capabilities, showing that PLAN-TUNING is an effective strategy for improving task-specific performance of smaller LLMs.