Towards the Law of Capacity Gap in Distilling Language Models

📅 2023-11-13
🏛️ arXiv.org
📈 Citations: 24
Influential: 2
📄 PDF

career value

210K/year
🤖 AI Summary
Large language model (LLM) distillation suffers from the “capacity gap curse”—where excessively large teacher models degrade student performance. Method: This work identifies that the optimal teacher-to-student size ratio follows a linear scaling law, and formally proposes and empirically validates the “Capacity Gap Law”: the optimal teacher size scales linearly and stably with the student size. This enables scaling-law-guided, targeted distillation—replacing heuristic trial-and-error. Contribution/Results: Through systematic experiments across diverse architectures and data scales, we apply the law to distill LLaMA2-7B into a 3B student, yielding MiniMA-3B. On mainstream benchmarks, MiniMA-3B significantly outperforms existing 3B models and approaches the performance of certain 7B models. The Capacity Gap Law establishes a generalizable theoretical foundation and practical methodology for efficient, principled LLM distillation.
📝 Abstract
Language model (LM) distillation is a trending area that aims to distil the knowledge residing in a large teacher LM to a small student one. While various methods have been proposed to maximize the effectiveness of the distillation, significant challenges persist, particularly when there is a substantial capacity gap between the teacher and student LMs. This issue, often referred to as the extit{curse} of capacity gap, suggests that a larger teacher does not necessarily result in a superior student compared to one distilled from a smaller teacher. In other words, there is likely an optimal teacher yielding the best student along the scaling course of the teacher. However, the curse of capacity gap can not be tackled without notable compute overhead, as indicated in previous studies. In the context of large LMs (LLMs), previously viable approaches become much less meaningful, as it is an impossible triangle to distill an expected student from an optimal teacher student with small compute overhead. Fortunately, the impossible triangle can fortunately be possible provided an inducted extit{law} of capacity gap. In this paper, we take the spirits of scaling law and reveal that the optimal teacher scale almost consistently follows a linear scaling with the student scale across different model architectures and data scales. The law later guides us to distil a 3B student LM (termed extsc{MiniMA}) from LLaMA2-7B. extsc{MiniMA} is demonstrated to outperform a wide range of 3B competitors and could even compete with several 7B models.
Problem

Research questions and friction points this paper is trying to address.

Identify optimal teacher size for LM distillation
Reduce computational cost in large LM distillation
Establish scaling law for teacher-student capacity gap
Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear scaling law for optimal teacher selection
Small-scale LM distillation preliminary study
Versatile LLMs outperforming competitors
🔎 Similar Papers
No similar papers found.