Language Model Distillation: A Temporal Difference Imitation Learning Perspective

📅 2025-05-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the low efficiency and insufficient knowledge transfer inherent in conventional behavior cloning for large language model (LLM) distillation, this paper proposes a novel distillation framework based on temporal-difference (TD) imitation learning. Methodologically, it is the first to formulate LM distillation as a TD imitation learning problem under distributional sparsity constraints—leveraging the natural sparsity of teacher-model token probabilities to construct reinforcement-learning-style objectives over dynamically pruned vocabularies, integrating vocabulary subset sampling with TD error minimization. This approach transcends the limitations of static KL-divergence matching, enabling more faithful and efficient knowledge transfer. Empirical results demonstrate that, across multiple benchmarks, student models of equal parameter count outperform those trained via standard knowledge distillation, while achieving over 30% improvement in training efficiency.

Technology Category

Application Category

📝 Abstract
Large language models have led to significant progress across many NLP tasks, although their massive sizes often incur substantial computational costs. Distillation has become a common practice to compress these large and highly capable models into smaller, more efficient ones. Many existing language model distillation methods can be viewed as behavior cloning from the perspective of imitation learning or inverse reinforcement learning. This viewpoint has inspired subsequent studies that leverage (inverse) reinforcement learning techniques, including variations of behavior cloning and temporal difference learning methods. Rather than proposing yet another specific temporal difference method, we introduce a general framework for temporal difference-based distillation by exploiting the distributional sparsity of the teacher model. Specifically, it is often observed that language models assign most probability mass to a small subset of tokens. Motivated by this observation, we design a temporal difference learning framework that operates on a reduced action space (a subset of vocabulary), and demonstrate how practical algorithms can be derived and the resulting performance improvements.
Problem

Research questions and friction points this paper is trying to address.

Compress large language models into smaller, efficient ones
Improve distillation using temporal difference learning framework
Leverage teacher model's sparse token distribution for efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal difference learning framework
Reduced action space utilization
Distributional sparsity exploitation
🔎 Similar Papers
No similar papers found.