SLoPe: Double-Pruned Sparse Plus Lazy Low-Rank Adapter Pretraining of LLMs

📅 2024-05-25

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

190K/year

🤖 AI Summary

To address the significant accuracy degradation in sparse large language model (LLM) pretraining, this paper proposes a synergistic pretraining framework—Dual-Pruning Sparsity + Lazy Low-Rank Adaptation (Lazy LoRA). It lightweightly injects LoRA modules only during the final 1% of training iterations and introduces, for the first time, a transposed-weight backpropagation mechanism under N:M structured pruning, enabling efficient co-optimization of sparse backward passes and low-rank updates. This is the first method to support end-to-end high-accuracy sparse pretraining. On OPT-33B and OPT-66B, it achieves 1.25× training speedup and 1.54× inference speedup, while reducing training and inference memory consumption to 0.63× and 0.61× that of dense baselines, respectively—maintaining accuracy close to dense models. The core contribution lies in a unified optimization framework that jointly balances structural sparsity, computational efficiency, and modeling capacity.

Technology Category

Application Category

📝 Abstract

We propose SLoPe, a Double-Pruned Sparse Plus Lazy Low-rank Adapter Pretraining method for LLMs that improves the accuracy of sparse LLMs while accelerating their pretraining and inference and reducing their memory footprint. Sparse pretraining of LLMs reduces the accuracy of the model, to overcome this, prior work uses dense models during fine-tuning. SLoPe improves the accuracy of sparsely pretrained models by adding low-rank adapters in the final 1% iterations of pretraining without adding significant overheads to the model pretraining and inference. In addition, SLoPe uses a double-pruned backward pass formulation that prunes the transposed weight matrix using N:M sparsity structures to enable an accelerated sparse backward pass. SLoPe accelerates the training and inference of models with billions of parameters up to $1.25 imes$ and $1.54 imes$ respectively (OPT-33B and OPT-66B) while reducing their memory usage by up to $0.63 imes$ and $0.61 imes$ for training and inference respectively.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Performance Optimization

Resource Efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

SLoPe

Efficient Large Language Models

Performance Optimization

🔎 Similar Papers

No similar papers found.

TikTok

San Jose, California

Principal High-Performance LLM Training Engineer

Nvidia

The base salary range is 272,000 USD - 431,250 USD. You will also be eligible for equity and benefits.

US, CA, Santa Clara

Authors to Follow