CRISP: Complex Reasoning with Interpretable Step-based Plans

📅 2025-07-09

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Large language models (LLMs) exhibit insufficient high-level planning capabilities for complex mathematical reasoning and code generation, and relying solely on few-shot prompting fails to reliably elicit high-quality explicit planning. Method: We propose targeted training—rather than prompting alone—to improve explicit plan generation. To this end, we introduce CRISP, the first dataset specifically designed for high-level planning, featuring automated high-order plan generation coupled with dual validation (LLM-based adjudication and downstream task evaluation), and adopt a small-model fine-tuning strategy. Contribution/Results: Experiments demonstrate that fine-tuned small models significantly outperform large models using few-shot chain-of-thought prompting. This work provides the first systematic empirical validation that explicit planning requires dedicated training. Moreover, it reveals strong cross-domain transferability of planning capabilities, establishing a new paradigm for interpretable and generalizable complex reasoning.

Technology Category

Application Category

📝 Abstract

Recent advancements in large language models (LLMs) underscore the need for stronger reasoning capabilities to solve complex problems effectively. While Chain-of-Thought (CoT) reasoning has been a step forward, it remains insufficient for many domains. A promising alternative is explicit high-level plan generation, but existing approaches largely assume that LLMs can produce effective plans through few-shot prompting alone, without additional training. In this work, we challenge this assumption and introduce CRISP (Complex Reasoning with Interpretable Step-based Plans), a multi-domain dataset of high-level plans for mathematical reasoning and code generation. The plans in CRISP are automatically generated and rigorously validated--both intrinsically, using an LLM as a judge, and extrinsically, by evaluating their impact on downstream task performance. We demonstrate that fine-tuning a small model on CRISP enables it to generate higher-quality plans than much larger models using few-shot prompting, while significantly outperforming Chain-of-Thought reasoning. Furthermore, our out-of-domain evaluation reveals that fine-tuning on one domain improves plan generation in the other, highlighting the generalizability of learned planning capabilities.

Problem

Research questions and friction points this paper is trying to address.

Enhancing reasoning in LLMs for complex problem-solving

Improving high-level plan generation beyond few-shot prompting

Validating plan quality for mathematical and coding tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated high-level plan generation and validation

Fine-tuning small models for superior plan quality

Cross-domain generalization of planning capabilities

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting