A Study on Hidden Layer Distillation for Large Language Model Pre-Training

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

191K/year
🤖 AI Summary
This study addresses the underexplored potential of hidden-layer distillation (HLD) in decoder-only large language models, where existing knowledge distillation approaches predominantly rely on output logits and largely neglect the rich semantic information embedded in intermediate teacher representations. For the first time, we conduct systematic pretraining experiments on the C4 dataset—scaling up to 168B tokens—using a Gemma3 3.4B teacher model to distill into smaller student models of 123M and 735M parameters. Our evaluation demonstrates that HLD consistently yields significant perplexity reductions across all shared hyperparameter settings compared to standard logit distillation. However, downstream task performance does not reliably surpass that of conventional logit-based distillation, suggesting that while intermediate signals can be effectively extracted, novel methodologies are still needed to fully harness their practical utility.
📝 Abstract
Knowledge Distillation (KD) is a critical tool for training Large Language Models (LLMs), yet the majority of research focuses on approaches that rely solely on output logits, neglecting semantic information in the teacher's intermediate representations. While Hidden Layer Distillation (HLD) showed potential for encoder architectures, its application to decoder-only pre-training at scale remains largely unexplored. Through compute-controlled experiments, we benchmark HLD against logit-based KD and self-supervised baselines with Gemma3 3.4B as teacher and 123M and 735M students trained on up to 168B tokens from the C4 dataset. Our experiments show that HLD does not consistently outperform standard KD on downstream evaluation tasks. Nevertheless, we show that HLD can yield a systematic perplexity gain over KD across all shared-hyperparameter configurations, suggesting that a latent signal can be extracted, but a breakthrough may be needed for it to play a more significant role in LLM pre-training.
Problem

Research questions and friction points this paper is trying to address.

Knowledge Distillation
Hidden Layer Distillation
Large Language Models
Pre-Training
Intermediate Representations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hidden Layer Distillation
Large Language Models
Knowledge Distillation
Decoder-only Pre-training
Perplexity Gain
🔎 Similar Papers
No similar papers found.