🤖 AI Summary
This work addresses a key limitation in existing policy-based self-distillation methods, which uniformly weight token-level supervision signals from the teacher model and thereby overlook variations in predictive distribution entropy, ultimately constraining training efficiency and performance. To overcome this, we propose Entropy-Guided Reinforced Self-Distillation (EGRSD), which dynamically adjusts token weights by integrating reward-oriented direction, the magnitude of teacher–student likelihood ratios, and a confidence gate modulated by the teacher’s predictive entropy. Furthermore, we introduce a causal lookahead variant, CL-EGRSD, that differentiates between persistent and transient high-entropy tokens. Evaluated on Qwen3-4B and Qwen3-8B, our approach significantly advances the Pareto frontier between reasoning accuracy and generation length.
📝 Abstract
On-policy self-distillation trains a reasoning model on its own rollouts while a teacher, often the same model conditioned on privileged context, provides dense token-level supervision. Existing objectives typically weight the teacher's token-level signal uniformly across a chain-of-thought sequence, despite substantial variation in the entropy of the teacher's predictive distribution. We propose EGRSD (Entropy-Guided Reinforced Self-Distillation), which unifies token-level updates through three signals: a reward-grounded direction, a teacher-student likelihood-ratio magnitude, and the proposed teacher-entropy confidence gate that down-weights high-entropy token positions while maintaining a nonzero lower bound on every token weight. We further introduce CL-EGRSD, a causal-lookahead variant that distinguishes sustained high-entropy spans from transient high-entropy positions whose following context rapidly becomes low entropy. Experiments with Qwen3-4B and Qwen3-8B in thinking mode show that EGRSD and CL-EGRSD advance the accuracy-length frontier among the compared trainable methods.