ALPS: Improved Optimization for Highly Sparse One-Shot Pruning for Large Language Models

📅 2024-06-12

🏛️ Neural Information Processing Systems

📈 Citations: 7

✨ Influential: 0

career value

220K/year

🤖 AI Summary

To address the limitations of heuristic-based pruning methods for large language models (LLMs)—namely, their inability to simultaneously achieve high sparsity and preserve model performance—this paper proposes the first optimization-based one-shot pruning framework specifically designed for ultra-sparse regimes (≥70% sparsity). The method innovatively integrates operator splitting with a preconditioned conjugate gradient solver, ensuring theoretical convergence guarantees. Furthermore, it leverages GPU parallelism and vectorized computation to enable efficient sparse weight updates. Evaluated on OPT-30B under 70% sparsity, the approach reduces WikiText perplexity by 13% and improves zero-shot task accuracy by 19%, significantly outperforming state-of-the-art heuristic pruning techniques. This work establishes a principled, scalable, and computationally efficient paradigm for high-sparsity LLM pruning.

Technology Category

Application Category

📝 Abstract

The impressive performance of Large Language Models (LLMs) across various natural language processing tasks comes at the cost of vast computational resources and storage requirements. One-shot pruning techniques offer a way to alleviate these burdens by removing redundant weights without the need for retraining. Yet, the massive scale of LLMs often forces current pruning approaches to rely on heuristics instead of optimization-based techniques, potentially resulting in suboptimal compression. In this paper, we introduce ALPS, an optimization-based framework that tackles the pruning problem using the operator splitting technique and a preconditioned conjugate gradient-based post-processing step. Our approach incorporates novel techniques to accelerate and theoretically guarantee convergence while leveraging vectorization and GPU parallelism for efficiency. ALPS substantially outperforms state-of-the-art methods in terms of the pruning objective and perplexity reduction, particularly for highly sparse models. On the OPT-30B model with 70% sparsity, ALPS achieves a 13% reduction in test perplexity on the WikiText dataset and a 19% improvement in zero-shot benchmark performance compared to existing methods.

Problem

Research questions and friction points this paper is trying to address.

Optimizing highly sparse one-shot pruning for large language models

Reducing computational and storage costs without retraining

Improving compression quality over heuristic-based pruning methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimization-based pruning with operator splitting

Preconditioned conjugate gradient post-processing

Vectorization and GPU parallelism for efficiency

🔎 Similar Papers

LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models