Is Human-Written Data Enough? The Challenge of Teaching Reasoning to LLMs Without RL or Distillation

📅 2025-07-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates whether long-chain reasoning capabilities can be effectively elicited in base language models using only a small number of high-quality, human-authored chain-of-thought (CoT) examples or lightweight fine-tuning—without resorting to reinforcement learning or large-model distillation. Method: We propose a synergistic approach integrating prompt engineering, multi-round structured editing, and parameter-efficient fine-tuning, leveraging merely 20 high-precision CoT samples—generated by advanced reasoning models and rigorously validated by human experts—to optimize Qwen2.5-32B. Contribution/Results: Our method yields substantial improvements in mathematical and logical reasoning performance; the fine-tuned model surpasses the larger Qwen2.5-Math-72B-Instruct across multiple benchmarks. Crucially, we empirically demonstrate that carefully curated, human-annotated CoT data exhibits exceptional transfer efficacy for reasoning capability, establishing a cost-effective paradigm for unlocking latent reasoning potential in foundational language models.

Technology Category

Application Category

📝 Abstract
Reasoning-capable language models achieve state-of-the-art performance in diverse complex tasks by generating long, explicit Chain-of-Thought (CoT) traces. While recent works show that base models can acquire such reasoning traces via reinforcement learning or distillation from stronger models like DeepSeek-R1, previous works demonstrate that even short CoT prompting without fine-tuning is able to improve reasoning. We ask whether long CoT can be induced in a base model using only prompting or minimal tuning. Using just 20 long CoT examples from the reasoning model exttt{QwQ-32B-Preview}, we lightly fine-tune the base model exttt{Qwen2.5-32B}. The resulting model outperforms the much larger exttt{Qwen2.5-Math-72B-Instruct}, showing that a handful of high-quality examples can unlock strong reasoning capabilities. We further explore using CoT data from non-reasoning models and human annotators, enhanced with prompt engineering, multi-pass editing, and structural guidance. However, neither matches the performance of reasoning model traces, suggesting that certain latent qualities of expert CoT are difficult to replicate. We analyze key properties of reasoning data, such as problem difficulty, diversity, and answer length, that influence reasoning distillation. While challenges remain, we are optimistic that carefully curated human-written CoT, even in small quantities, can activate reasoning behaviors in base models. We release our human-authored dataset across refinement stages and invite further investigation into what makes small-scale reasoning supervision so effective.
Problem

Research questions and friction points this paper is trying to address.

Can long Chain-of-Thought reasoning be induced in base models without extensive tuning?
Does minimal fine-tuning with few high-quality examples unlock strong reasoning capabilities?
Can human-written or non-expert CoT data match reasoning model performance?
Innovation

Methods, ideas, or system contributions that make the work stand out.

Light fine-tuning with few expert CoT examples
Prompt engineering and multi-pass editing
Analyzing key properties of reasoning data
🔎 Similar Papers
No similar papers found.
W
Wei Du
Nvidia Corporation, 2788 San Tomas Expy, Santa Clara, CA 95051, USA
B
Branislav Kisacanin
Nvidia Corporation, 2788 San Tomas Expy, Santa Clara, CA 95051, USA; Institute for Artificial Intelligence Research and Development of Serbia, Fruškogorska 1, 21000 Novi Sad, Serbia; Faculty of Technical Sciences, University of Novi Sad, Trg Dositeja Obradovica 3, 21000 Novi Sad, Serbia; AwesomeMath, P.O. Box 261490, Plano, TX 75026, USA
G
George Armstrong
Nvidia Corporation, 2788 San Tomas Expy, Santa Clara, CA 95051, USA
Shubham Toshniwal
Shubham Toshniwal
Senior Research Scientist, NVIDIA
ReasoningMemoryNLP
I
Ivan Moshkov
Nvidia Corporation, 2788 San Tomas Expy, Santa Clara, CA 95051, USA
A
Alexan Ayrapetyan
Nvidia Corporation, 2788 San Tomas Expy, Santa Clara, CA 95051, USA
S
Sadegh Mahdavi
Nvidia Corporation, 2788 San Tomas Expy, Santa Clara, CA 95051, USA
D
Dan Zhao
Nvidia Corporation, 2788 San Tomas Expy, Santa Clara, CA 95051, USA
Shizhe Diao
Shizhe Diao
NVIDIA Research
Large Language ModelsNatural Language Processing
Dragan Masulovic
Dragan Masulovic
Department of Mathematics and Informatics, Faculty of Sciences, University of Novi Sad, Trg Dositeja Obradovica 3, 21000 Novi Sad, Serbia
M
Marius Stanean
AwesomeMath, P.O. Box 261490, Plano, TX 75026, USA
A
Advaith Avadhanam
Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA
M
Max Wang
Department of Computer and Information Science, School of Engineering and Applied Science, University of Pennsylvania, 209 South 33rd Street, Philadelphia, PA 19104, USA
A
Ashmit Dutta
University of Illinois Urbana-Champaign, 201 N Goodwin Ave, Urbana, IL 61801
Shitij Govil
Shitij Govil
Georgia Institute of Technology
Reinforcement LearningAI for Science
S
Sri Yanamandara
University of Illinois Urbana-Champaign, 201 N Goodwin Ave, Urbana, IL 61801
M
Mihir Tandon
University of Illinois Urbana-Champaign, 201 N Goodwin Ave, Urbana, IL 61801
S
Sriram Ananthakrishnan
University of Chicago, 5801 S Ellis Ave, Chicago, IL 60637, USA
V
Vedant Rathi
University of Illinois Urbana-Champaign, 201 N Goodwin Ave, Urbana, IL 61801
D
David Zhang
University of Illinois Urbana-Champaign, 201 N Goodwin Ave, Urbana, IL 61801
J
Joonseok Kang
University of Illinois Urbana-Champaign, 201 N Goodwin Ave, Urbana, IL 61801
L
Leon Luo
University of Chicago, 5801 S Ellis Ave, Chicago, IL 60637, USA
T
Titu Andreescu
AwesomeMath, P.O. Box 261490, Plano, TX 75026, USA
Boris Ginsburg
Boris Ginsburg
NVIDIA
Deep LearningSpeech RecognitionSpeech Synthesis
Igor Gitman
Igor Gitman
Applied Scientist, NVIDIA
Large Language ModelsMath ReasoningDeep Learning