BLaST: High Performance Inference and Pretraining using BLock Sparse Transformers

📅 2025-07-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high energy consumption caused by data movement in large language model (LLM) inference, this paper proposes an efficient block-wise sparsification method tailored for Transformer linear layers. We introduce a novel, hardware-agnostic block sparsity pattern enabling up to 95% sparsity with negligible accuracy degradation; further, we integrate iterative sparsification with a highly optimized fused sparse matrix-matrix multiplication (SpMM) kernel. Our approach delivers end-to-end acceleration across diverse hardware architectures and datasets. Experiments show up to 16.7× speedup for MLP computation, 1.6× end-to-end inference acceleration, 1.11× pretraining speedup, and a 3.12× reduction in inference memory footprint. Crucially, the method achieves a superior trade-off among accuracy, throughput, and energy efficiency—establishing a new paradigm for efficient LLM deployment.

Technology Category

Application Category

📝 Abstract
The energy consumption of large-scale ML models is dominated by data movement - shuffling billions of parameters across memory hierarchies and data centers. Effective sparsification to prune redundant parameters is still challenging: existing methods incur significant accuracy degradation, performance overhead, or both. We introduce (Bl)ock (a)nd (S)parse (T)ransformers (BLaST), a general, robust, and reliable sparsification method applicable to linear layers in all settings. Our method iteratively sparsifies weight matrices into a block sparsity pattern suitable for efficient sparse matrix-matrix (SpMM) multiplication. BLaST achieves up to 95% sparsity in MLP weights with negligible accuracy loss. Our fused, highly optimized Sparse MLP kernel delivers up to 16.7x speedup over dense MLPs across 9 architectures and 8 datasets, resulting in up to 1.6x inference speedup, 1.11x pretraining speedup and up to 3.12x inference memory usage reduction. BLaST enables the next generation of large-scale AI systems by reducing energy use, memory footprint, and latency.
Problem

Research questions and friction points this paper is trying to address.

Reducing energy use in large ML models via sparsification
Minimizing accuracy loss while pruning redundant model parameters
Optimizing sparse matrix operations for faster inference and pretraining
Innovation

Methods, ideas, or system contributions that make the work stand out.

Block sparse transformers for efficient sparsification
Iterative sparsification for optimal SpMM multiplication
Fused Sparse MLP kernel for significant speedup
🔎 Similar Papers
No similar papers found.