🤖 AI Summary
Large language models (LLMs) exhibit insufficient high-level planning capabilities for complex mathematical reasoning and code generation, and relying solely on few-shot prompting fails to reliably elicit high-quality explicit planning. Method: We propose targeted training—rather than prompting alone—to improve explicit plan generation. To this end, we introduce CRISP, the first dataset specifically designed for high-level planning, featuring automated high-order plan generation coupled with dual validation (LLM-based adjudication and downstream task evaluation), and adopt a small-model fine-tuning strategy. Contribution/Results: Experiments demonstrate that fine-tuned small models significantly outperform large models using few-shot chain-of-thought prompting. This work provides the first systematic empirical validation that explicit planning requires dedicated training. Moreover, it reveals strong cross-domain transferability of planning capabilities, establishing a new paradigm for interpretable and generalizable complex reasoning.
📝 Abstract
Recent advancements in large language models (LLMs) underscore the need for stronger reasoning capabilities to solve complex problems effectively. While Chain-of-Thought (CoT) reasoning has been a step forward, it remains insufficient for many domains. A promising alternative is explicit high-level plan generation, but existing approaches largely assume that LLMs can produce effective plans through few-shot prompting alone, without additional training. In this work, we challenge this assumption and introduce CRISP (Complex Reasoning with Interpretable Step-based Plans), a multi-domain dataset of high-level plans for mathematical reasoning and code generation. The plans in CRISP are automatically generated and rigorously validated--both intrinsically, using an LLM as a judge, and extrinsically, by evaluating their impact on downstream task performance. We demonstrate that fine-tuning a small model on CRISP enables it to generate higher-quality plans than much larger models using few-shot prompting, while significantly outperforming Chain-of-Thought reasoning. Furthermore, our out-of-domain evaluation reveals that fine-tuning on one domain improves plan generation in the other, highlighting the generalizability of learned planning capabilities.