🤖 AI Summary
This work investigates whether long-chain reasoning capabilities can be effectively elicited in base language models using only a small number of high-quality, human-authored chain-of-thought (CoT) examples or lightweight fine-tuning—without resorting to reinforcement learning or large-model distillation.
Method: We propose a synergistic approach integrating prompt engineering, multi-round structured editing, and parameter-efficient fine-tuning, leveraging merely 20 high-precision CoT samples—generated by advanced reasoning models and rigorously validated by human experts—to optimize Qwen2.5-32B.
Contribution/Results: Our method yields substantial improvements in mathematical and logical reasoning performance; the fine-tuned model surpasses the larger Qwen2.5-Math-72B-Instruct across multiple benchmarks. Crucially, we empirically demonstrate that carefully curated, human-annotated CoT data exhibits exceptional transfer efficacy for reasoning capability, establishing a cost-effective paradigm for unlocking latent reasoning potential in foundational language models.
📝 Abstract
Reasoning-capable language models achieve state-of-the-art performance in diverse complex tasks by generating long, explicit Chain-of-Thought (CoT) traces. While recent works show that base models can acquire such reasoning traces via reinforcement learning or distillation from stronger models like DeepSeek-R1, previous works demonstrate that even short CoT prompting without fine-tuning is able to improve reasoning. We ask whether long CoT can be induced in a base model using only prompting or minimal tuning. Using just 20 long CoT examples from the reasoning model exttt{QwQ-32B-Preview}, we lightly fine-tune the base model exttt{Qwen2.5-32B}. The resulting model outperforms the much larger exttt{Qwen2.5-Math-72B-Instruct}, showing that a handful of high-quality examples can unlock strong reasoning capabilities. We further explore using CoT data from non-reasoning models and human annotators, enhanced with prompt engineering, multi-pass editing, and structural guidance. However, neither matches the performance of reasoning model traces, suggesting that certain latent qualities of expert CoT are difficult to replicate. We analyze key properties of reasoning data, such as problem difficulty, diversity, and answer length, that influence reasoning distillation. While challenges remain, we are optimistic that carefully curated human-written CoT, even in small quantities, can activate reasoning behaviors in base models. We release our human-authored dataset across refinement stages and invite further investigation into what makes small-scale reasoning supervision so effective.