🤖 AI Summary
This work identifies a novel risk in language model knowledge distillation—“teacher hijacking”—where student models overfit biased teacher models, deviating from the true data distribution and suffering performance degradation. Method: We establish an oracle–teacher–student three-tier controlled experimental framework to formally define and empirically validate this phenomenon; investigate offline distillation as a primary trigger and demonstrate that online data generation with diversity enhancement significantly mitigates hijacking; and identify polynomial convergence deviation as an effective detection signal. Contribution/Results: We empirically confirm the ubiquity of teacher hijacking across diverse model families and tasks, propose a systematic solution that is both detectable (via convergence diagnostics) and mitigatable (via dynamic data augmentation and online generation), and provide theoretical insights and practical engineering guidelines for robust, efficient language model distillation.
📝 Abstract
Post-training of language models (LMs) increasingly relies on the following two stages: (i) knowledge distillation, where the LM is trained to imitate a larger teacher LM, and (ii) reinforcement learning from human feedback (RLHF), where the LM is aligned by optimizing a reward model. In the second RLHF stage, a well-known challenge is reward hacking, where the LM over-optimizes the reward model. Such phenomenon is in line with Goodhart's law and can lead to degraded performance on the true objective. In this paper, we investigate whether a similar phenomenon, that we call teacher hacking, can occur during knowledge distillation. This could arise because the teacher LM is itself an imperfect approximation of the true distribution. To study this, we propose a controlled experimental setup involving: (i) an oracle LM representing the ground-truth distribution, (ii) a teacher LM distilled from the oracle, and (iii) a student LM distilled from the teacher. Our experiments reveal the following insights. When using a fixed offline dataset for distillation, teacher hacking occurs; moreover, we can detect it by observing when the optimization process deviates from polynomial convergence laws. In contrast, employing online data generation techniques effectively mitigates teacher hacking. More precisely, we identify data diversity as the key factor in preventing hacking. Overall, our findings provide a deeper understanding of the benefits and limitations of distillation for building robust and efficient LMs.