Curriculum Learning-Guided Progressive Distillation in Large Language Models

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

This work addresses a key limitation in existing knowledge distillation methods for large language models, where the neglect of training data sequencing and mismatched teacher-student model capacities often prevents stronger teachers from effectively enhancing student performance. To overcome this, the authors propose Curriculum Learning-guided Progressive Distillation (CLPD), a novel framework that jointly models data difficulty and teacher capability for the first time. CLPD integrates an explicit data curriculum with an implicit teacher scheduling mechanism within a unified architecture. By modularly combining curriculum learning, progressive distillation, and dynamic teacher scheduling, the framework consistently outperforms standard distillation and ablated variants across multiple reasoning benchmarks, demonstrating that the co-optimization of data ordering and teacher scheduling is crucial for efficient knowledge transfer.

📝 Abstract

Knowledge distillation is a key technique for transferring the capabilities of large language models (LLMs) into smaller, more efficient student models. Existing distillation approaches often overlook two critical factors: the learning order of training data and the capacity mismatch between teacher and student models. This oversight limits distillation performance, as manifested by the counter-intuitive phenomenon where stronger teachers fail to produce better students. In this work, we propose Curriculum Learning-Guided Progressive Distillation (CLPD), a unified framework that explicitly accounts for both factors by aligning data difficulty with teacher strength. CLPD constructs an explicit curriculum by organizing training examples from easy to hard, while simultaneously applying an implicit curriculum over supervision signals by progressively scheduling teachers of increasing capacity. Our framework is modular and can be integrated into standard distillation algorithms with minimal overhead. Empirical results on the reasoning benchmarks demonstrate that CLPD consistently outperforms standard distillation, data ordering alone, and teacher scheduling alone across multiple settings. These findings highlight the importance of jointly considering data ordering and teacher capacity when distilling reasoning abilities into small language models.

Problem

Research questions and friction points this paper is trying to address.

knowledge distillation

curriculum learning

teacher-student capacity mismatch

data ordering

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Curriculum Learning

Progressive Distillation

Knowledge Distillation