AWP: Activation-Aware Weight Pruning and Quantization with Projected Gradient Descent

📅 2025-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of deploying large language models (LLMs) on edge devices—namely, stringent constraints on model size and computational resources—this paper proposes an inter-layer post-training compression method that jointly optimizes activation-aware weight quantization and structured pruning. The core contribution is the first formulation of activation-aware pruning as a sparse-constrained optimization problem, solved via a novel unified framework integrating projected gradient descent (PGD) and iterative hard thresholding (IHT), with theoretical convergence guarantees. Leveraging activation statistics, the method co-optimizes weight sparsification and low-bit quantization across layers. Extensive experiments on multiple LLMs and standard benchmarks demonstrate that our approach significantly outperforms state-of-the-art pruning and quantization methods, achieving high accuracy even under aggressive compression ratios—making it particularly suitable for resource-constrained edge deployment.

Technology Category

Application Category

📝 Abstract
To address the enormous size of Large Language Models (LLMs), model compression methods, such as quantization and pruning, are often deployed, especially on edge devices. In this work, we focus on layer-wise post-training quantization and pruning. Drawing connections between activation-aware weight pruning and sparse approximation problems, and motivated by the success of Iterative Hard Thresholding (IHT), we propose a unified method for Activation-aware Weight pruning and quantization via Projected gradient descent (AWP). Our experiments demonstrate that AWP outperforms state-of-the-art LLM pruning and quantization methods. Theoretical convergence guarantees of the proposed method for pruning are also provided.
Problem

Research questions and friction points this paper is trying to address.

Compress Large Language Models for edge devices
Unify pruning and quantization via projected gradient descent
Improve performance over existing compression methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Activation-aware weight pruning via gradient descent
Unified quantization and pruning with projection
Layer-wise post-training compression optimization
🔎 Similar Papers
No similar papers found.
J
Jing Liu
Mitsubishi Electric Research Laboratories (MERL)
T
T. Koike-Akino
Mitsubishi Electric Research Laboratories (MERL)
Y
Ye Wang
Mitsubishi Electric Research Laboratories (MERL)
Hassan Mansour
Hassan Mansour
Mitsubishi Electric Research Laboratories
Computational ImagingOptimizationSparse RecoveryImage/Video AnalyticsMachine Learning
M
Matthew Brand
Mitsubishi Electric Research Laboratories (MERL)