Accelerating Transformer Inference and Training with 2:4 Activation Sparsity

📅 2025-03-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the low training and inference efficiency of large language models (LLMs) by proposing the first hardware-friendly, activation-aware 2:4 structured sparsity method. Unlike conventional approaches that impose sparsity only on weights, our method extends 2:4 sparsity to the activation domain—leveraging the intrinsic sparsity of the Squared-ReLU activation function—to enable joint acceleration of forward and backward passes without retraining or fine-tuning. We design custom CUDA kernels and a sparse computation scheduler for feed-forward network (FFN) layers, achieving up to 1.3× speedup in FFN computation. End-to-end training and inference are significantly accelerated while strictly preserving the original model’s accuracy. Our approach overcomes a key technical bottleneck in applying structured sparsity to activations, establishing a new paradigm for efficient LLM deployment.

Technology Category

Application Category

📝 Abstract
In this paper, we demonstrate how to leverage 2:4 sparsity, a popular hardware-accelerated GPU sparsity pattern, to activations to accelerate large language model training and inference. Crucially we exploit the intrinsic sparsity found in Squared-ReLU activations to provide this acceleration with no accuracy loss. Our approach achieves up to 1.3x faster Feed Forward Network (FFNs) in both the forwards and backwards pass. This work highlights the potential for sparsity to play a key role in accelerating large language model training and inference.
Problem

Research questions and friction points this paper is trying to address.

Leveraging 2:4 sparsity to accelerate transformer training and inference
Using Squared-ReLU activations to enable sparsity without accuracy loss
Achieving faster Feed Forward Network operations in both passes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverage 2:4 sparsity for GPU acceleration
Use Squared-ReLU for lossless sparsity
Achieve 1.3x faster FFN operations
🔎 Similar Papers
No similar papers found.