🤖 AI Summary
This work addresses the challenge of distilling large reasoning models into smaller ones when constrained by the “teacher ceiling”—the limitation that teacher models often fail to solve complex problems independently. To overcome this, the authors propose an efficient, reinforcement learning–free distillation framework grounded in the Zone of Proximal Development theory. The framework integrates three key innovations: Guided Entropy-Assisted Repair (GEAR), a Perplexity-Uncertainty Ratio Estimator (PURE), and Progressive Answer-Guided Curriculum Evolution (PACE). Notably, it introduces an active intervention mechanism that combines dynamic entropy analysis with hindsight prompting, moving beyond conventional static sample filtering. This approach effectively leverages teacher failure cases to enhance student model performance. Experiments demonstrate that the method significantly outperforms standard supervised fine-tuning and other baselines across multiple reasoning benchmarks, confirming its effectiveness and generalizability.
📝 Abstract
Distilling reasoning capabilities from Large Reasoning Models (LRMs) into smaller models is typically constrained by the limitation of rejection sampling. Standard methods treat the teacher as a static filter, discarding complex"corner-case"problems where the teacher fails to explore valid solutions independently, thereby creating an artificial"Teacher Ceiling"for the student. In this work, we propose Hindsight Entropy-Assisted Learning (HEAL), an RL-free framework designed to bridge this reasoning gap. Drawing on the educational theory of the Zone of Proximal Development(ZPD), HEAL synergizes three core modules: (1) Guided Entropy-Assisted Repair (GEAR), an active intervention mechanism that detects critical reasoning breakpoints via entropy dynamics and injects targeted hindsight hints to repair broken trajectories; (2) Perplexity-Uncertainty Ratio Estimator (PURE), a rigorous filtering protocol that decouples genuine cognitive breakthroughs from spurious shortcuts; and (3) Progressive Answer-guided Curriculum Evolution (PACE), a three-stage distillation strategy that organizes training from foundational alignment to frontier breakthrough. Extensive experiments on multiple benchmarks demonstrate that HEAL significantly outperforms traditional SFT distillation and other baselines.