PERP: Rethinking the Prune-Retrain Paradigm in the Era of LLMs

📅 2023-12-23
🏛️ arXiv.org
📈 Citations: 9
Influential: 0
📄 PDF
🤖 AI Summary
To address the infeasibility of full retraining after pruning large language models (LLMs) due to GPU memory and computational constraints, this paper proposes a retraining-free, extremely sparse fine-tuning paradigm: updating only 0.01%–0.05% of the most expressive parameters suffices to restore—or even surpass—full-retraining performance. Key contributions include: (1) the first empirical demonstration that ultra-low-parameter updates can effectively substitute full retraining; (2) two mergeable, sparsity-preserving LoRA variants; and (3) an inter-layer weight reconstruction mechanism for efficient sparse model enhancement. On GPT architectures, pruning followed by fine-tuning completes in minutes on a single GPU for a 30B model. Across sparsity levels, our method matches or exceeds full retraining accuracy and significantly outperforms baseline approaches—including Wanda and SparseGPT—while maintaining structural sparsity and computational efficiency.
📝 Abstract
Neural Networks can be effectively compressed through pruning, significantly reducing storage and compute demands while maintaining predictive performance. Simple yet effective methods like magnitude pruning remove less important parameters and typically require a costly retraining procedure to restore performance. However, with the rise of LLMs, full retraining has become infeasible due to memory and compute constraints. This study challenges the practice of retraining all parameters by showing that updating a small subset of highly expressive parameters can suffice to recover or even enhance performance after pruning. Surprisingly, retraining just 0.01%-0.05% of the parameters in GPT-architectures can match the performance of full retraining across various sparsity levels, significantly reducing compute and memory requirements, and enabling retraining of models with up to 30 billion parameters on a single GPU in minutes. To bridge the gap to full retraining in the high sparsity regime, we introduce two novel LoRA variants that, unlike standard LoRA, allow merging adapters back without compromising sparsity. Going a step further, we show that these methods can be applied for memory-efficient layer-wise reconstruction, significantly enhancing state-of-the-art retraining-free methods like Wanda (Sun et al., 2023) and SparseGPT (Frantar&Alistarh, 2023). Our findings present a promising alternative to avoiding retraining.
Problem

Research questions and friction points this paper is trying to address.

Optimizes neural network pruning efficiency
Reduces retraining computational demands
Enhances performance with minimal parameter updates
Innovation

Methods, ideas, or system contributions that make the work stand out.

Selective parameter retraining for efficiency
Novel LoRA variants for sparsity maintenance
Memory-efficient layer-wise reconstruction enhancement
🔎 Similar Papers
No similar papers found.
Max Zimmer
Max Zimmer
Zuse Institute Berlin
Deep LearningOptimizationMathematics
M
Megi Andoni
Department for AI in Society, Science, and Technology, Zuse Institute Berlin, Germany
C
Christoph Spiegel
Department for AI in Society, Science, and Technology, Zuse Institute Berlin, Germany
S
Sebastian Pokutta
Department for AI in Society, Science, and Technology, Zuse Institute Berlin, Germany; Institute of Mathematics, Technische Universität Berlin, Germany