🤖 AI Summary
This study systematically investigates the dynamics of training data memorization in knowledge distillation for language models and its implications for privacy and generalization. The authors demonstrate that distilled models exhibit over 50% lower memorization rates compared to standard fine-tuned counterparts, with more than 95% of memorized content attributable to a small subset of highly memorable samples. Through experiments across Pythia, OLMo-2, and Qwen-3 models trained on FineWeb, Wikitext, and Nemotron-CC-v2 datasets, they reveal that hard distillation incurs 2.7 times higher risk of inheriting teacher-specific memorized samples than soft distillation. Leveraging metrics such as zlib entropy, KL divergence, and perplexity, the work further introduces a prior predictability method to anticipate memorization propensity. These findings indicate that knowledge distillation not only enhances generalization but also significantly mitigates privacy risks associated with data memorization.
📝 Abstract
Knowledge Distillation (KD) is increasingly adopted to transfer capabilities from large language models to smaller ones, offering significant improvements in efficiency and utility while often surpassing standard fine-tuning. Beyond performance, KD is also explored as a privacy-preserving mechanism to mitigate the risk of training data leakage. While training data memorization has been extensively studied in standard pre-training and fine-tuning settings, its dynamics in a knowledge distillation setup remain poorly understood. In this work, we study memorization across the KD pipeline using three large language model (LLM) families (Pythia, OLMo-2, Qwen-3) and three datasets (FineWeb, Wikitext, Nemotron-CC-v2). We find: (1) distilled models memorize significantly less training data than standard fine-tuning (reducing memorization by more than 50%); (2) some examples are inherently easier to memorize and account for a large fraction of memorization during distillation (over ~95%); (3) student memorization is predictable prior to distillation using features based on zlib entropy, KL divergence, and perplexity; and (4) while soft and hard distillation have similar overall memorization rates, hard distillation poses a greater risk: it inherits $2.7\times$ more teacher-specific examples than soft distillation. Overall, we demonstrate that distillation can provide both improved generalization and reduced memorization risks compared to standard fine-tuning.