MACKO: Sparse Matrix-Vector Multiplication for Low Sparsity

📅 2025-11-17
📈 Citations: 0
Influential: 0
📄 PDF

career value

229K/year
🤖 AI Summary
To address the inefficiency of sparse matrix–vector multiplication (SpMV) and limited memory compression/acceleration under low unstructured sparsity (30–90%) in sparse large language model (LLM) inference, this paper proposes MACKO-SpMV: a co-designed GPU-native sparse storage format and customized CUDA kernel. MACKO-SpMV requires no specialized hardware or preprocessing and supports FP16 as well as mainstream pruning strategies (e.g., Wanda). At 50% sparsity, it achieves 1.5× memory compression and 1.2–1.5× end-to-end inference speedup. On Llama2-7B, it outperforms cuSPARSE, Sputnik, and DASP by 2.8–13.0×, 1.9–2.6×, and 2.2–2.5×, respectively. By eliminating hardware dependencies and preserving compatibility with standard pruning pipelines, MACKO-SpMV significantly enhances the practicality of unstructured pruning in real-world LLM deployment scenarios.

Technology Category

Application Category

📝 Abstract
Sparse Matrix-Vector Multiplication (SpMV) is a fundamental operation in the inference of sparse Large Language Models (LLMs). Because existing SpMV methods perform poorly under the low and unstructured sparsity (30-90%) commonly observed in pruned LLMs, unstructured pruning provided only limited memory reduction and speedup. We propose MACKO-SpMV, a GPU-optimized format and kernel co-designed to reduce storage overhead while preserving compatibility with the GPU's execution model. This enables efficient SpMV for unstructured sparsity without specialized hardware units (e.g., tensor cores) or format-specific precomputation. Empirical results show that at sparsity 50%, MACKO is the first approach with significant 1.5x memory reduction and 1.2-1.5x speedup over dense representation. Speedups over other SpMV baselines: 2.8-13.0x over cuSPARSE, 1.9-2.6x over Sputnik, and 2.2-2.5x over DASP. Applied to Llama2-7B pruned with Wanda to sparsity 50%, it delivers 1.5x memory reduction and 1.5x faster inference at fp16 precision. Thanks to MACKO, unstructured pruning at 50% sparsity is now justified in real-world LLM workloads.
Problem

Research questions and friction points this paper is trying to address.

Optimizing SpMV for low sparsity in pruned LLMs
Reducing storage overhead while maintaining GPU compatibility
Enabling efficient unstructured pruning without specialized hardware
Innovation

Methods, ideas, or system contributions that make the work stand out.

GPU-optimized format and kernel co-design
Efficient SpMV for unstructured sparsity
Reduces storage overhead without specialized hardware