ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models

📅 2024-02-21
🏛️ arXiv.org
📈 Citations: 15
Influential: 0
📄 PDF
🤖 AI Summary
Low activation sparsity in large language models (LLMs) during inference—caused by non-sparse activation functions such as GELU and Swish—severely limits computational efficiency. To address this, we propose a novel sparse activation regularization framework based on ReLU substitution coupled with progressive sinusoidal decay scheduling. Our method introduces an activation distribution smoothing constraint and a multi-stage sinusoidal scheduler that dynamically increases sparsity during training while preserving model stability and accuracy. To our knowledge, this is the first work to achieve up to 89.32% activation sparsity on mainstream open-source LLMs—including LLaMA2-7B/13B and MiniCPM-1B—significantly surpassing prior approaches. Crucially, it maintains the original Swish-based model’s accuracy while delivering up to 4.52× inference speedup. The proposed paradigm offers a scalable, high-performance solution for efficient LLM deployment through structured activation sparsification.

Technology Category

Application Category

📝 Abstract
Activation sparsity refers to the existence of considerable weakly-contributed elements among activation outputs. As a prevalent property of the models using the ReLU activation function, activation sparsity has been proven a promising paradigm to boost model inference efficiency. Nevertheless, most large language models (LLMs) adopt activation functions without intrinsic activation sparsity (e.g., GELU and Swish). Some recent efforts have explored introducing ReLU or its variants as the substitutive activation function to help LLMs achieve activation sparsity and inference acceleration, but few can simultaneously obtain high sparsity and comparable model performance. This paper introduces a simple and effective sparsification method named"ProSparse"to push LLMs for higher activation sparsity while maintaining comparable performance. Specifically, after substituting the activation function of LLMs with ReLU, ProSparse adopts progressive sparsity regularization with a factor smoothly increasing along the multi-stage sine curves. This can enhance activation sparsity and mitigate performance degradation by avoiding radical shifts in activation distributions. With ProSparse, we obtain high sparsity of 89.32% for LLaMA2-7B, 88.80% for LLaMA2-13B, and 87.89% for end-size MiniCPM-1B, respectively, achieving comparable performance to their original Swish-activated versions. These present the most sparsely activated models among open-source LLaMA versions and competitive end-size models, considerably surpassing ReluLLaMA-7B (66.98%) and ReluLLaMA-13B (71.56%). Our inference acceleration experiments further demonstrate the significant practical acceleration potential of LLMs with higher activation sparsity, obtaining up to 4.52$ imes$ inference speedup.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Activation Sparsity
Efficient Inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

ProSparse
Activation Sparsity
Language Models
🔎 Similar Papers
No similar papers found.
Chenyang Song
Chenyang Song
PhD student, Tsinghua University
large language modelefficient architecture
X
Xu Han
Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China
Zhengyan Zhang
Zhengyan Zhang
Tsinghua University
Natural Language ProcessingLarge Language Models
Shengding Hu
Shengding Hu
Tsinghua University
LLMArtificial Super Intelligence
Xiyu Shi
Xiyu Shi
Institute for Digital Technologies, Loughborough University London
Speech signal processmobile and wireless communicationnetwork securityInternet of things
K
Kuai Li
Tencent Machine Learning Platform, China
C
Chen Chen
Tencent Machine Learning Platform, China
Z
Zhiyuan Liu
Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China
G
Guanglin Li
SKLP, Institute of Computing Technology, Chinese Academy of Sciences, China
T
Tao Yang
Tencent Machine Learning Platform, China
Maosong Sun
Maosong Sun
Professor of Computer Science and Technology, Tsinghua University
Natural Language ProcessingArtificial IntelligenceSocial Computing