🤖 AI Summary
To address the inefficiency of sparse matrix–vector multiplication (SpMV) and limited memory compression/acceleration under low unstructured sparsity (30–90%) in sparse large language model (LLM) inference, this paper proposes MACKO-SpMV: a co-designed GPU-native sparse storage format and customized CUDA kernel. MACKO-SpMV requires no specialized hardware or preprocessing and supports FP16 as well as mainstream pruning strategies (e.g., Wanda). At 50% sparsity, it achieves 1.5× memory compression and 1.2–1.5× end-to-end inference speedup. On Llama2-7B, it outperforms cuSPARSE, Sputnik, and DASP by 2.8–13.0×, 1.9–2.6×, and 2.2–2.5×, respectively. By eliminating hardware dependencies and preserving compatibility with standard pruning pipelines, MACKO-SpMV significantly enhances the practicality of unstructured pruning in real-world LLM deployment scenarios.
📝 Abstract
Sparse Matrix-Vector Multiplication (SpMV) is a fundamental operation in the inference of sparse Large Language Models (LLMs). Because existing SpMV methods perform poorly under the low and unstructured sparsity (30-90%) commonly observed in pruned LLMs, unstructured pruning provided only limited memory reduction and speedup. We propose MACKO-SpMV, a GPU-optimized format and kernel co-designed to reduce storage overhead while preserving compatibility with the GPU's execution model. This enables efficient SpMV for unstructured sparsity without specialized hardware units (e.g., tensor cores) or format-specific precomputation. Empirical results show that at sparsity 50%, MACKO is the first approach with significant 1.5x memory reduction and 1.2-1.5x speedup over dense representation. Speedups over other SpMV baselines: 2.8-13.0x over cuSPARSE, 1.9-2.6x over Sputnik, and 2.2-2.5x over DASP. Applied to Llama2-7B pruned with Wanda to sparsity 50%, it delivers 1.5x memory reduction and 1.5x faster inference at fp16 precision. Thanks to MACKO, unstructured pruning at 50% sparsity is now justified in real-world LLM workloads.