Why Distillation can Outperform Zero-RL: The Role of Flexible Reasoning

📅 2025-05-27

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Despite its computational efficiency, knowledge distillation’s superiority over zero-shot reinforcement learning (zero-RL) in enhancing large language models’ (LLMs) reasoning capabilities remains poorly understood. Method: The authors conduct token-level frequency analysis and cognitive behavioral pattern identification on models trained via supervised knowledge distillation (using only 920 high-quality examples) versus zero-RL. Contribution/Results: They demonstrate that distillation—unlike zero-RL—specifically amplifies two higher-order cognitive behaviors critical for flexible reasoning: multi-perspective thinking and metacognitive awareness. Distilled models outperform zero-RL counterparts on complex reasoning tasks and exhibit significantly higher usage of logical connectives and anthropomorphic expressions—empirically validating a strong association between these cognitive behaviors and reasoning ability. This work uncovers the intrinsic cognitive mechanisms underlying distillation-driven reasoning improvement and establishes a novel, resource-efficient paradigm for LLM alignment.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) has played an important role in improving the reasoning ability of large language models (LLMs). Some studies apply RL directly to extit{smaller} base models (known as zero-RL) and also achieve notable progress. However, in this paper, we show that using only 920 examples, a simple distillation method based on the base model can clearly outperform zero-RL, which typically requires much more data and computational cost. By analyzing the token frequency in model outputs, we find that the distilled model shows more flexible reasoning. It uses anthropomorphic tokens and logical connectors much more often than the zero-RL model. Further analysis reveals that distillation enhances the presence of two advanced cognitive behaviors: Multi-Perspective Thinking or Attempting and Metacognitive Awareness. Frequent occurrences of these two advanced cognitive behaviors give rise to flexible reasoning, which is essential for solving complex reasoning problems, while zero-RL fails to significantly boost the frequency of these behaviors.

Problem

Research questions and friction points this paper is trying to address.

Distillation outperforms zero-RL with fewer examples

Distilled models show more flexible reasoning behaviors

Distillation enhances advanced cognitive behaviors in reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distillation outperforms RL with fewer examples

Distilled model uses more flexible reasoning tokens

Enhances Multi-Perspective Thinking and Metacognitive Awareness

🔎 Similar Papers

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL

2024-10-02arXiv.orgCitations: 0

Reset & Distill: A Recipe for Overcoming Negative Transfer in Continual Reinforcement Learning

2024-03-08arXiv.orgCitations: 3

Authors to Follow