SLoPe: Double-Pruned Sparse Plus Lazy Low-Rank Adapter Pretraining of LLMs

📅 2024-05-25
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
To address the significant accuracy degradation in sparse large language model (LLM) pretraining, this paper proposes a synergistic pretraining framework—Dual-Pruning Sparsity + Lazy Low-Rank Adaptation (Lazy LoRA). It lightweightly injects LoRA modules only during the final 1% of training iterations and introduces, for the first time, a transposed-weight backpropagation mechanism under N:M structured pruning, enabling efficient co-optimization of sparse backward passes and low-rank updates. This is the first method to support end-to-end high-accuracy sparse pretraining. On OPT-33B and OPT-66B, it achieves 1.25× training speedup and 1.54× inference speedup, while reducing training and inference memory consumption to 0.63× and 0.61× that of dense baselines, respectively—maintaining accuracy close to dense models. The core contribution lies in a unified optimization framework that jointly balances structural sparsity, computational efficiency, and modeling capacity.

Technology Category

Application Category

📝 Abstract
We propose SLoPe, a Double-Pruned Sparse Plus Lazy Low-rank Adapter Pretraining method for LLMs that improves the accuracy of sparse LLMs while accelerating their pretraining and inference and reducing their memory footprint. Sparse pretraining of LLMs reduces the accuracy of the model, to overcome this, prior work uses dense models during fine-tuning. SLoPe improves the accuracy of sparsely pretrained models by adding low-rank adapters in the final 1% iterations of pretraining without adding significant overheads to the model pretraining and inference. In addition, SLoPe uses a double-pruned backward pass formulation that prunes the transposed weight matrix using N:M sparsity structures to enable an accelerated sparse backward pass. SLoPe accelerates the training and inference of models with billions of parameters up to $1.25 imes$ and $1.54 imes$ respectively (OPT-33B and OPT-66B) while reducing their memory usage by up to $0.63 imes$ and $0.61 imes$ for training and inference respectively.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Performance Optimization
Resource Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

SLoPe
Efficient Large Language Models
Performance Optimization
🔎 Similar Papers
No similar papers found.