🤖 AI Summary
This work addresses the distribution shift and error accumulation arising from the mismatch between teacher-generated prefixes and student autoregressive prefix distributions in offline inference distillation. To mitigate this issue, the paper proposes a principled offline distillation framework that introduces a distribution alignment-aware adaptive supervision weighting mechanism. Operating entirely in a purely offline setting—without requiring online sampling—the method dynamically corrects the student’s policy distribution, thereby enhancing both inference stability and accuracy. Experimental results demonstrate that the proposed approach significantly outperforms existing offline distillation methods across multiple mathematical reasoning benchmarks, including GSM8K, MATH, AMC, AIME, and OlympiadBench, achieving substantial gains in reasoning accuracy and trajectory consistency while preserving strong instruction-following capabilities.
📝 Abstract
Distilling reasoning traces from strong large language models into smaller ones is a promising route to improve intelligence in resource-constrained settings. Existing approaches face a fundamental trade-off: offline distillation from teacher-generated traces provides high-quality, sample-efficient supervision but suffers from distributional drift: during training, the student model conditions on teacher-generated prefixes, whereas during inference the student autoregresses on self-generated prefixes, leading to compounding errors over long reasoning trajectories. Meanwhile, on-policy or self-distillation methods better match the student's inference-time distribution, but require costly online sampling and often produce low-quality traces in early training. We propose a principled offline reasoning distillation framework that preserves the efficiency and supervision quality of offline teacher-generated data while correcting teacher-student distribution drift. It adaptively emphasizes teacher supervision that is better aligned with the student's on-policy distribution. Evaluations on mathematical reasoning benchmarks of GSM8K, MATH, MATH500, and harder held-out competition-style tasks, including AMC, AIME, and OlympiadBench, show that our method improves reasoning accuracy over prior offline distillation algorithms and yields more stable reasoning traces while preserving instruction-following capabilities. Our work shows that lightweight, distribution-correction-aware training can substantially strengthen offline reasoning distillation without online rollouts.