🤖 AI Summary
To address the high energy consumption caused by data movement in large language model (LLM) inference, this paper proposes an efficient block-wise sparsification method tailored for Transformer linear layers. We introduce a novel, hardware-agnostic block sparsity pattern enabling up to 95% sparsity with negligible accuracy degradation; further, we integrate iterative sparsification with a highly optimized fused sparse matrix-matrix multiplication (SpMM) kernel. Our approach delivers end-to-end acceleration across diverse hardware architectures and datasets. Experiments show up to 16.7× speedup for MLP computation, 1.6× end-to-end inference acceleration, 1.11× pretraining speedup, and a 3.12× reduction in inference memory footprint. Crucially, the method achieves a superior trade-off among accuracy, throughput, and energy efficiency—establishing a new paradigm for efficient LLM deployment.
📝 Abstract
The energy consumption of large-scale ML models is dominated by data movement - shuffling billions of parameters across memory hierarchies and data centers. Effective sparsification to prune redundant parameters is still challenging: existing methods incur significant accuracy degradation, performance overhead, or both. We introduce (Bl)ock (a)nd (S)parse (T)ransformers (BLaST), a general, robust, and reliable sparsification method applicable to linear layers in all settings. Our method iteratively sparsifies weight matrices into a block sparsity pattern suitable for efficient sparse matrix-matrix (SpMM) multiplication. BLaST achieves up to 95% sparsity in MLP weights with negligible accuracy loss. Our fused, highly optimized Sparse MLP kernel delivers up to 16.7x speedup over dense MLPs across 9 architectures and 8 datasets, resulting in up to 1.6x inference speedup, 1.11x pretraining speedup and up to 3.12x inference memory usage reduction. BLaST enables the next generation of large-scale AI systems by reducing energy use, memory footprint, and latency.