Amber Pruner: Leveraging N:M Activation Sparsity for Efficient Prefill in Large Language Models

📅 2025-08-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational cost of linear layers during the prefill phase of large language models (LLMs) and the limited generalizability of existing N:M sparsity methods—which predominantly target weights—we propose Amber Pruner, the first training-free N:M activation sparsity method. Specifically designed for linear projection layers, Amber Pruner operates within the Outstanding-sparse unified framework and synergistically co-optimizes with post-training W8A8 quantization to significantly accelerate inference while preserving model accuracy. Experiments across diverse LLMs demonstrate that Amber Pruner achieves over 55% reduction in linear-layer computation under multiple N:M patterns (e.g., 2:4, 4:8, 8:16) without degrading generation performance. Its core contribution lies in establishing the first training-agnostic N:M activation sparsity scheme and empirically validating its strong cross-model generalizability and practical efficacy.

Technology Category

Application Category

📝 Abstract
In the era of large language models (LLMs), N:M sparsity has emerged as a structured compression technique critical for accelerating inference. While prior work has primarily focused on weight sparsity, it often suffers from significant accuracy degradation. Activation sparsity, though promising, is typically training-dependent and faces challenges in generalization. To address these limitations, we introduce Amber Pruner, a training-free N:M activation sparsity method designed specifically for the prefill stage, targeting the acceleration of linear projection layers in LLMs. Extensive experiments across multiple models and sparsity ratios (2:4, 4:8, and 8:16) demonstrate that Amber Pruner can effectively sparsify and accelerate more than 55% of linear computations without requiring model retraining. To further enhance generality and efficiency, we propose Outstanding-sparse, a unified framework that integrates Amber Pruner with post-training W8A8 quantization. Our approach preserves strong performance across a range of downstream tasks, with notable advantages in generative tasks. This work pioneers a new frontier in activation sparsity, providing foundational insights that are poised to guide the co-evolution of algorithms and architectures in the design of next-generation AI systems.
Problem

Research questions and friction points this paper is trying to address.

Accelerating LLM prefill with N:M activation sparsity
Overcoming accuracy loss in training-free sparsity methods
Integrating sparsity and quantization for efficient inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free N:M activation sparsity method
Accelerates linear projection layers in LLMs
Integrates with W8A8 quantization for efficiency
🔎 Similar Papers
No similar papers found.
Tai An
Tai An
University of Rochester
R
Ruwu Cai
Huawei Technologies Co., Ltd
Y
Yanzhe Zhang
Huawei Technologies Co., Ltd
Y
Yang Liu
Huawei Technologies Co., Ltd
H
Hao Chen
Huawei Technologies Co., Ltd
P
Pengcheng Xie
Huawei Technologies Co., Ltd
S
Sheng Chang
Huawei Technologies Co., Ltd
Yiwu Yao
Yiwu Yao
Peking University
Artificial Intelligence
G
Gongyi Wang
Huawei Technologies Co., Ltd