Self-Data Distillation for Recovering Quality in Pruned Large Language Models

๐Ÿ“… 2024-10-13
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the significant performance degradation following structured pruning of large language models (LLMs) and catastrophic forgetting induced by supervised fine-tuning (SFT), this paper proposes self-data distillation fine-tuning. Our method leverages the original unpruned model to autonomously generate semantically rich, distribution-aligned distillation data, thereby enabling knowledge alignment and quality recovery for the pruned model. This work is the first to integrate self-data distillation into post-pruning fine-tuningโ€”achieving simultaneous model compression and distribution stability without requiring external annotated data, while natively supporting model merging and speculative decoding. Evaluated on Llama3.1-8B Instruct with six layers pruned, our approach retains 91.2% accuracy (vs. 81.7% for SFT baseline), improves average accuracy by 8% on the Hugging Face OpenLLM Leaderboard, and reduces measured FLOPs by 16.3%.

Technology Category

Application Category

๐Ÿ“ Abstract
Large language models have driven significant progress in natural language processing, but their deployment requires substantial compute and memory resources. As models scale, compression techniques become essential for balancing model quality with computational efficiency. Structured pruning, which removes less critical components of the model, is a promising strategy for reducing complexity. However, one-shot pruning often results in significant quality degradation, particularly in tasks requiring multi-step reasoning. To recover lost quality, supervised fine-tuning (SFT) is commonly applied, but it can lead to catastrophic forgetting by shifting the model's learned data distribution. Therefore, addressing the degradation from both pruning and SFT is essential to preserve the original model's quality. In this work, we utilize self-data distilled fine-tuning to address these challenges. Our approach leverages the original, unpruned model to generate a distilled dataset that preserves semantic richness and mitigates catastrophic forgetting by maintaining alignment with the base model's knowledge. Empirically, we demonstrate that self-data distillation consistently outperforms standard SFT, improving average accuracy by up to 8% on the HuggingFace OpenLLM Leaderboard v1. Specifically, when pruning six decoder blocks on Llama3.1-8B Instruct (i.e., 32 to 26 layers, reducing the model size from 8.03B to 6.72B parameters), our method retains 91.2% of the original model's accuracy compared to 81.7% with SFT, while reducing real-world FLOPs by 16.3%. Furthermore, combining self-data distilled models through model merging yields enhanced quality retention. Additionally, leveraging these pruned models in speculative decoding increases token acceptance rates, thereby improving inference efficiency in applied settings.
Problem

Research questions and friction points this paper is trying to address.

Recovering quality in pruned large language models
Mitigating catastrophic forgetting during supervised fine-tuning
Improving inference efficiency while retaining model accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-data distilled fine-tuning for quality recovery
Original model generates distilled dataset
Combines pruning with speculative decoding
๐Ÿ”Ž Similar Papers
No similar papers found.
Vithursan Thangarasa
Vithursan Thangarasa
Principal Research Scientist, Cerebras Systems
MLLLMComputer VisionSparsityAI Safety
Ganesh Venkatesh
Ganesh Venkatesh
Cerebras Systems, Sunnyvale, California
N
Nish Sinnadurai
Cerebras Systems, Sunnyvale, California
S
Sean Lie
Cerebras Systems, Sunnyvale, California