Distillation Scaling Laws

📅 2025-02-12

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This paper addresses the lack of theoretical guidance for allocating computational budget between teacher and student models in knowledge distillation. We establish, for the first time, a computational scaling law for distillation learning, quantitatively characterizing how student performance varies with teacher–student compute allocation and total budget. Methodologically, we propose a predictive distillation scaling law, derived via large-scale cross-model-size distillation experiments, computational modeling, and empirical law fitting, yielding optimal compute allocation strategies for two practical scenarios. Key contributions include: (1) uncovering how the performance crossover point—where distillation surpasses supervised pretraining—evolves with model scale; (2) identifying the critical compute threshold beyond which multi-student distillation consistently outperforms supervised training; and (3) providing a reusable, generalizable compute configuration paradigm for industrial-scale model compression, substantially reducing deployment risks in large-scale distillation.

Technology Category

Application Category

📝 Abstract

We provide a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings reduce the risks associated with using distillation at scale; compute allocation for both the teacher and student models can now be done to maximize student performance. We provide compute optimal distillation recipes for when 1) a teacher exists, or 2) a teacher needs training. If many students are to be distilled, or a teacher already exists, distillation outperforms supervised pretraining until a compute level which grows predictably with student size. If one student is to be distilled and a teacher also needs training, supervised learning should be done instead. Additionally, we provide insights across our large scale study of distillation, which increase our understanding of distillation and inform experimental design.

Problem

Research questions and friction points this paper is trying to address.

Estimate distilled model performance

Optimize compute allocation

Compare distillation and supervised learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Estimates performance via compute budget

Maximizes student model performance

Provides optimal distillation recipes

🔎 Similar Papers

Rethinking Conventional Wisdom in Machine Learning: From Generalization to Scaling