🤖 AI Summary
To address the infeasibility of full retraining after pruning large language models (LLMs) due to GPU memory and computational constraints, this paper proposes a retraining-free, extremely sparse fine-tuning paradigm: updating only 0.01%–0.05% of the most expressive parameters suffices to restore—or even surpass—full-retraining performance. Key contributions include: (1) the first empirical demonstration that ultra-low-parameter updates can effectively substitute full retraining; (2) two mergeable, sparsity-preserving LoRA variants; and (3) an inter-layer weight reconstruction mechanism for efficient sparse model enhancement. On GPT architectures, pruning followed by fine-tuning completes in minutes on a single GPU for a 30B model. Across sparsity levels, our method matches or exceeds full retraining accuracy and significantly outperforms baseline approaches—including Wanda and SparseGPT—while maintaining structural sparsity and computational efficiency.
📝 Abstract
Neural Networks can be effectively compressed through pruning, significantly reducing storage and compute demands while maintaining predictive performance. Simple yet effective methods like magnitude pruning remove less important parameters and typically require a costly retraining procedure to restore performance. However, with the rise of LLMs, full retraining has become infeasible due to memory and compute constraints. This study challenges the practice of retraining all parameters by showing that updating a small subset of highly expressive parameters can suffice to recover or even enhance performance after pruning. Surprisingly, retraining just 0.01%-0.05% of the parameters in GPT-architectures can match the performance of full retraining across various sparsity levels, significantly reducing compute and memory requirements, and enabling retraining of models with up to 30 billion parameters on a single GPU in minutes. To bridge the gap to full retraining in the high sparsity regime, we introduce two novel LoRA variants that, unlike standard LoRA, allow merging adapters back without compromising sparsity. Going a step further, we show that these methods can be applied for memory-efficient layer-wise reconstruction, significantly enhancing state-of-the-art retraining-free methods like Wanda (Sun et al., 2023) and SparseGPT (Frantar&Alistarh, 2023). Our findings present a promising alternative to avoiding retraining.