Sparser, Faster, Lighter Transformer Language Models

📅 2026-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational and memory costs of large autoregressive language models, which stem primarily from the dense and parameter-heavy feedforward layers. To mitigate this, the authors propose an efficient sparse computation framework tailored for modern GPU architectures. By applying L1 regularization to induce unstructured sparsity, and coupling it with a custom CUDA kernel and a novel sparse packing format, the method achieves over 99% sparsity in feedforward layers while preserving near-lossless performance on downstream tasks. The approach substantially improves throughput, energy efficiency, and memory utilization, with benefits amplifying as model scale increases.

Technology Category

Application Category

📝 Abstract
Scaling autoregressive large language models (LLMs) has driven unprecedented progress but comes with vast computational costs. In this work, we tackle these costs by leveraging unstructured sparsity within an LLM's feedforward layers, the components accounting for most of the model parameters and execution FLOPs. To achieve this, we introduce a new sparse packing format and a set of CUDA kernels designed to seamlessly integrate with the optimized execution pipelines of modern GPUs, enabling efficient sparse computation during LLM inference and training. To substantiate our gains, we provide a quantitative study of LLM sparsity, demonstrating that simple L1 regularization can induce over 99% sparsity with negligible impact on downstream performance. When paired with our kernels, we show that these sparsity levels translate into substantial throughput, energy efficiency, and memory usage benefits that increase with model scale. We will release all code and kernels under an open-source license to promote adoption and accelerate research toward establishing sparsity as a practical axis for improving the efficiency and scalability of modern foundation models.
Problem

Research questions and friction points this paper is trying to address.

large language models
computational cost
sparsity
feedforward layers
model efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

unstructured sparsity
sparse packing format
CUDA kernels
L1 regularization
efficient LLM inference
🔎 Similar Papers
No similar papers found.