🤖 AI Summary
This work addresses the low training and inference efficiency of large language models (LLMs) by proposing the first hardware-friendly, activation-aware 2:4 structured sparsity method. Unlike conventional approaches that impose sparsity only on weights, our method extends 2:4 sparsity to the activation domain—leveraging the intrinsic sparsity of the Squared-ReLU activation function—to enable joint acceleration of forward and backward passes without retraining or fine-tuning. We design custom CUDA kernels and a sparse computation scheduler for feed-forward network (FFN) layers, achieving up to 1.3× speedup in FFN computation. End-to-end training and inference are significantly accelerated while strictly preserving the original model’s accuracy. Our approach overcomes a key technical bottleneck in applying structured sparsity to activations, establishing a new paradigm for efficient LLM deployment.
📝 Abstract
In this paper, we demonstrate how to leverage 2:4 sparsity, a popular hardware-accelerated GPU sparsity pattern, to activations to accelerate large language model training and inference. Crucially we exploit the intrinsic sparsity found in Squared-ReLU activations to provide this acceleration with no accuracy loss. Our approach achieves up to 1.3x faster Feed Forward Network (FFNs) in both the forwards and backwards pass. This work highlights the potential for sparsity to play a key role in accelerating large language model training and inference.