MiniPLM: Knowledge Distillation for Pre-Training Language Models

📅 2024-10-22
🏛️ arXiv.org
📈 Citations: 1
Influential: 1
📄 PDF
🤖 AI Summary
Pretraining language model (PLM) knowledge distillation faces challenges including high computational overhead, strict tokenization alignment between teacher and student models, and difficulty preserving the diversity of teacher-generated data. This paper proposes MiniPLM, the first pretraining distillation framework based on offline teacher inference and data distribution refinement. It reconstructs the training distribution by offline knowledge extraction from the teacher, eliminating reliance on real-time inference and relaxing architectural and tokenizer compatibility constraints. Furthermore, it introduces dynamic sample reweighting and cross-model-family distillation to explicitly leverage capability disparities between large and small models, thereby enhancing data difficulty and diversity. Evaluated on nine downstream tasks, the distilled student models achieve significant performance gains, improved language modeling capabilities, reduced pretraining computational cost, and demonstrate strong scalability to large-scale settings.

Technology Category

Application Category

📝 Abstract
Knowledge distillation (KD) is widely used to train small, high-performing student language models (LMs) using large teacher LMs. While effective in fine-tuning, KD during pre-training faces efficiency, flexibility, and effectiveness issues. Existing methods either incur high computational costs due to online teacher inference, require tokenization matching between teacher and student LMs, or risk losing the difficulty and diversity of the teacher-generated training data. In this work, we propose MiniPLM, a KD framework for pre-training LMs by refining the training data distribution with the teacher LM's knowledge. For efficiency, MiniPLM performs offline teacher inference, allowing KD for multiple student LMs without adding training costs. For flexibility, MiniPLM operates solely on the training corpus, enabling KD across model families. For effectiveness, MiniPLM leverages the differences between large and small LMs to enhance the training data difficulty and diversity, helping student LMs acquire versatile and sophisticated knowledge. Extensive experiments demonstrate that MiniPLM boosts the student LMs' performance on 9 common downstream tasks, improves language modeling capabilities, and reduces pre-training computation. The benefit of MiniPLM extends to larger training scales, evidenced by the scaling curve extrapolation. Further analysis reveals that MiniPLM supports KD across model families and enhances the pre-training data utilization. Our code, data, and models can be found at https://github.com/thu-coai/MiniPLM.
Problem

Research questions and friction points this paper is trying to address.

Efficient knowledge distillation for pre-training language models
Flexible KD across different model families
Enhancing training data difficulty and diversity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Offline teacher inference reduces computational costs.
Enables knowledge distillation across different model families.
Enhances training data difficulty and diversity effectively.
🔎 Similar Papers
No similar papers found.
Yuxian Gu
Yuxian Gu
Tsinghua University
Natural Language Processing
H
Hao Zhou
WeChat AI, Tencent Inc., China
Fandong Meng
Fandong Meng
WeChat AI, Tencent
Machine TranslationNatural Language Processing
J
Jie Zhou
WeChat AI, Tencent Inc., China
M
Minlie Huang
The CoAI Group, Tsinghua University