AWP: Activation-Aware Weight Pruning and Quantization with Projected Gradient Descent

📅 2025-06-11

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

To address the challenges of deploying large language models (LLMs) on edge devices—namely, stringent constraints on model size and computational resources—this paper proposes an inter-layer post-training compression method that jointly optimizes activation-aware weight quantization and structured pruning. The core contribution is the first formulation of activation-aware pruning as a sparse-constrained optimization problem, solved via a novel unified framework integrating projected gradient descent (PGD) and iterative hard thresholding (IHT), with theoretical convergence guarantees. Leveraging activation statistics, the method co-optimizes weight sparsification and low-bit quantization across layers. Extensive experiments on multiple LLMs and standard benchmarks demonstrate that our approach significantly outperforms state-of-the-art pruning and quantization methods, achieving high accuracy even under aggressive compression ratios—making it particularly suitable for resource-constrained edge deployment.

Technology Category

Application Category

📝 Abstract

To address the enormous size of Large Language Models (LLMs), model compression methods, such as quantization and pruning, are often deployed, especially on edge devices. In this work, we focus on layer-wise post-training quantization and pruning. Drawing connections between activation-aware weight pruning and sparse approximation problems, and motivated by the success of Iterative Hard Thresholding (IHT), we propose a unified method for Activation-aware Weight pruning and quantization via Projected gradient descent (AWP). Our experiments demonstrate that AWP outperforms state-of-the-art LLM pruning and quantization methods. Theoretical convergence guarantees of the proposed method for pruning are also provided.

Problem

Research questions and friction points this paper is trying to address.

Compress Large Language Models for edge devices

Unify pruning and quantization via projected gradient descent

Improve performance over existing compression methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Activation-aware weight pruning via gradient descent

Unified quantization and pruning with projection

Layer-wise post-training compression optimization

🔎 Similar Papers

No similar papers found.