HEAL: Hindsight Entropy-Assisted Learning for Reasoning Distillation

📅 2026-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of distilling large reasoning models into smaller ones when constrained by the “teacher ceiling”—the limitation that teacher models often fail to solve complex problems independently. To overcome this, the authors propose an efficient, reinforcement learning–free distillation framework grounded in the Zone of Proximal Development theory. The framework integrates three key innovations: Guided Entropy-Assisted Repair (GEAR), a Perplexity-Uncertainty Ratio Estimator (PURE), and Progressive Answer-Guided Curriculum Evolution (PACE). Notably, it introduces an active intervention mechanism that combines dynamic entropy analysis with hindsight prompting, moving beyond conventional static sample filtering. This approach effectively leverages teacher failure cases to enhance student model performance. Experiments demonstrate that the method significantly outperforms standard supervised fine-tuning and other baselines across multiple reasoning benchmarks, confirming its effectiveness and generalizability.

Technology Category

Application Category

📝 Abstract
Distilling reasoning capabilities from Large Reasoning Models (LRMs) into smaller models is typically constrained by the limitation of rejection sampling. Standard methods treat the teacher as a static filter, discarding complex"corner-case"problems where the teacher fails to explore valid solutions independently, thereby creating an artificial"Teacher Ceiling"for the student. In this work, we propose Hindsight Entropy-Assisted Learning (HEAL), an RL-free framework designed to bridge this reasoning gap. Drawing on the educational theory of the Zone of Proximal Development(ZPD), HEAL synergizes three core modules: (1) Guided Entropy-Assisted Repair (GEAR), an active intervention mechanism that detects critical reasoning breakpoints via entropy dynamics and injects targeted hindsight hints to repair broken trajectories; (2) Perplexity-Uncertainty Ratio Estimator (PURE), a rigorous filtering protocol that decouples genuine cognitive breakthroughs from spurious shortcuts; and (3) Progressive Answer-guided Curriculum Evolution (PACE), a three-stage distillation strategy that organizes training from foundational alignment to frontier breakthrough. Extensive experiments on multiple benchmarks demonstrate that HEAL significantly outperforms traditional SFT distillation and other baselines.
Problem

Research questions and friction points this paper is trying to address.

reasoning distillation
teacher ceiling
corner-case problems
Large Reasoning Models
rejection sampling
Innovation

Methods, ideas, or system contributions that make the work stand out.

reasoning distillation
entropy-assisted learning
Zone of Proximal Development
hindsight intervention
curriculum evolution
W
Wenjing Zhang
Data Science & AI Research Institute, China Unicom; Unicom Data Intelligence, China Unicom
J
Jiangze Yan
Data Science & AI Research Institute, China Unicom; Unicom Data Intelligence, China Unicom
J
Jieyun Huang
Data Science & AI Research Institute, China Unicom; Unicom Data Intelligence, China Unicom
Y
Yi Shen
Data Science & AI Research Institute, China Unicom; Unicom Data Intelligence, China Unicom
Shuming Shi
Shuming Shi
Tencent AI Lab
NLPtext understandingknowledge miningtext generationweb search
P
Ping Chen
Data Science & AI Research Institute, China Unicom; Unicom Data Intelligence, China Unicom
Ning Wang
Ning Wang
Huawei Inc., University of Science and Technology of China (USTC)
Computer VisionVisual TrackingCross-modal
Zhaoxiang Liu
Zhaoxiang Liu
China Unicom
Computer VisionDeep LearningRoboticsHuman-Computer Interaction
Kai Wang
Kai Wang
China Unicom Digital Technology
3D VisionRoboticsAugmented/Virtual RealityArtificial IntelligenceComputer Graphics
Shiguo Lian
Shiguo Lian
CloudMinds