🤖 AI Summary
To address the challenges of deploying large language models (LLMs) on edge devices—namely, stringent constraints on model size and computational resources—this paper proposes an inter-layer post-training compression method that jointly optimizes activation-aware weight quantization and structured pruning. The core contribution is the first formulation of activation-aware pruning as a sparse-constrained optimization problem, solved via a novel unified framework integrating projected gradient descent (PGD) and iterative hard thresholding (IHT), with theoretical convergence guarantees. Leveraging activation statistics, the method co-optimizes weight sparsification and low-bit quantization across layers. Extensive experiments on multiple LLMs and standard benchmarks demonstrate that our approach significantly outperforms state-of-the-art pruning and quantization methods, achieving high accuracy even under aggressive compression ratios—making it particularly suitable for resource-constrained edge deployment.
📝 Abstract
To address the enormous size of Large Language Models (LLMs), model compression methods, such as quantization and pruning, are often deployed, especially on edge devices. In this work, we focus on layer-wise post-training quantization and pruning. Drawing connections between activation-aware weight pruning and sparse approximation problems, and motivated by the success of Iterative Hard Thresholding (IHT), we propose a unified method for Activation-aware Weight pruning and quantization via Projected gradient descent (AWP). Our experiments demonstrate that AWP outperforms state-of-the-art LLM pruning and quantization methods. Theoretical convergence guarantees of the proposed method for pruning are also provided.