Distillation of Large Language Models via Concrete Score Matching

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Softmax smoothing in large language model (LLM) knowledge distillation obscures fine-grained logit details, while direct logit distillation ignores translation invariance—degrading alignment fidelity. Method: We propose Concrete Score Distillation (CSD), the first framework to introduce discrete score matching into LLM distillation. CSD models relative logit differences between token pairs to enable fine-grained student–teacher alignment at the logit level. It employs a logit-level score-matching objective, augmented with stable training strategies and a linear-complexity approximation for efficient integration into autoregressive distillation pipelines. Contribution/Results: CSD relaxes restrictive constraints on the optimal solution space imposed by conventional objectives, enabling flexible weighting for improved mode coverage and diversity preservation. Experiments on GPT-2-1.5B, OpenLLaMA-7B, and Gemma-7B-IT demonstrate significant gains over state-of-the-art distillation methods, with strong scalability and synergistic benefits when combined with online policy optimization.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) deliver remarkable performance but are costly to deploy, motivating knowledge distillation (KD) for efficient inference. Existing KD objectives typically match student and teacher probabilities via softmax, which blurs valuable logit information. While direct logit distillation (DLD) mitigates softmax smoothing, it fails to account for logit shift invariance, thereby restricting the solution space. We propose Concrete Score Distillation (CSD), a discrete score-matching objective that overcomes both softmax-induced smoothing and restrictions on the optimal solution set. We resolve the training instability and quadratic complexity of discrete score-matching in autoregressive LLMs, and the resulting CSD objective aligns relative logit differences across all vocabulary pairs between student and teacher with flexible weighting. We provide both mode-seeking and mode-covering instances within our framework and evaluate CSD on task-agnostic instruction-following and task-specific distillation using GPT-2-1.5B, OpenLLaMA-7B, and GEMMA-7B-IT. Experiments show that CSD consistently surpasses recent KD objectives, achieves favorable fidelity-diversity trade-offs, and yields complementary gains when combined with on-policy techniques, demonstrating its scalability and effectiveness for LLM distillation.
Problem

Research questions and friction points this paper is trying to address.

Overcomes softmax blurring and logit shift invariance in distillation
Resolves training instability in autoregressive language model distillation
Improves fidelity-diversity trade-offs across various model architectures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Concrete Score Distillation overcomes softmax smoothing
Aligns relative logit differences across vocabulary pairs
Resolves training instability in autoregressive LLMs
🔎 Similar Papers
No similar papers found.
Y
Yeongmin Kim
Korea Advanced Institute of Science and Technology (KAIST)
D
Donghyeok Shin
Korea Advanced Institute of Science and Technology (KAIST)
M
Mina Kang
Korea Advanced Institute of Science and Technology (KAIST)
Byeonghu Na
Byeonghu Na
KAIST
Generative ModelDiffusion Model
Il-Chul Moon
Il-Chul Moon
Professor, Department of Industrial and Systems Engineering, KAIST
Modeling and SimulationArtificial Intelligence