Sparsing Law: Towards Large Language Models with Greater Activation Sparsity

📅 2024-11-04
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of systematic quantitative analysis of activation sparsity in large language models (LLMs). We propose a performance-aware sparsity metric, PPL-p%, and empirically discover three universal scaling laws in decoder-only Transformer models: (1) activation sparsity converges following a power law with respect to training dataset size; (2) it increases linearly with the width-to-depth ratio; and (3) it exhibits weak sensitivity to total parameter count. Through controlled experiments across activation functions (ReLU vs. SiLU) and model scales (varying width, depth, and parameter count), we find ReLU consistently outperforms SiLU—yielding higher activation sparsity and improved data efficiency during training. These findings provide a quantifiable theoretical foundation for designing efficient and interpretable LLM architectures, and support a “depth-first” structural optimization paradigm.

Technology Category

Application Category

📝 Abstract
Activation sparsity denotes the existence of substantial weakly-contributed elements within activation outputs that can be eliminated, benefiting many important applications concerned with large language models (LLMs). Although promoting greater activation sparsity within LLMs deserves deep studies, existing works lack comprehensive and quantitative research on the correlation between activation sparsity and potentially influential factors. In this paper, we present a comprehensive study on the quantitative scaling properties and influential factors of the activation sparsity within decoder-only Transformer-based LLMs. Specifically, we propose PPL-$p%$ sparsity, a precise and performance-aware activation sparsity metric that is applicable to any activation function. Through extensive experiments, we find several important phenomena. Firstly, different activation functions exhibit comparable performance but opposite training-time sparsity trends. The activation ratio (i.e., $1-mathrm{sparsity ratio}$) evolves as a convergent increasing power-law and decreasing logspace power-law with the amount of training data for SiLU-activated and ReLU-activated LLMs, respectively. These demonstrate that ReLU is more efficient as the activation function than SiLU and can leverage more training data to improve activation sparsity. Secondly, the activation ratio linearly increases with the width-depth ratio below a certain bottleneck point, indicating the potential advantage of a deeper architecture at a fixed parameter scale. Finally, at similar width-depth ratios, we surprisingly find that the limit value of activation sparsity varies weakly with the parameter scale, i.e., the activation patterns within LLMs are insensitive to the parameter scale. These empirical laws towards LLMs with greater activation sparsity have important implications for making LLMs more efficient and interpretable.
Problem

Research questions and friction points this paper is trying to address.

Study activation sparsity in large language models.
Propose PPL-$p %$ sparsity metric for activation functions.
Analyze sparsity trends with training data and architectures.
Innovation

Methods, ideas, or system contributions that make the work stand out.

PPL-$p% sparsity metric
ReLU enhances activation sparsity
Deeper architectures improve sparsity
🔎 Similar Papers
No similar papers found.
Y
Yuqi Luo
Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China
Chenyang Song
Chenyang Song
PhD student, Tsinghua University
large language modelefficient architecture
X
Xu Han
Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China
Yingfa Chen
Yingfa Chen
PhD at Tsinghua University
machine learninglong-context modelinglanguage modeling
Chaojun Xiao
Chaojun Xiao
Postdoctoral Researcher, Tsinghua University
Large Language Model
Z
Zhiyuan Liu
Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China
Maosong Sun
Maosong Sun
Professor of Computer Science and Technology, Tsinghua University
Natural Language ProcessingArtificial IntelligenceSocial Computing