🤖 AI Summary
To address the high computational and memory overheads caused by dense weights in deep learning model deployment, this paper proposes an efficient GPU sparse matrix multiplication (SpMM) framework tailored for general N:M sparsity patterns. We introduce a novel top-down performance modeling methodology specifically designed for N:M sparsity, enabling systematic analysis and optimization. Our approach features a hierarchical, block-wise execution strategy with sparsity-aware memory access scheduling and instruction-level pipelining optimizations—achieving, for the first time on general N:M sparse matrices, performance close to the theoretical peak. Experimental evaluation demonstrates average speedups of 2.1× over nmSPARSE and 1.4–6.3× over cuBLAS dense GEMM across diverse benchmarks; our implementation attains over 92% of the theoretical sparsity-bound performance ceiling. The open-source implementation is publicly available on GitHub.
📝 Abstract
Deep learning demonstrates effectiveness across a wide range of tasks. However, the dense and over-parameterized nature of these models results in significant resource consumption during deployment. In response to this issue, weight pruning, particularly through N:M sparsity matrix multiplication, offers an efficient solution by transforming dense operations into semi-sparse ones. N:M sparsity provides an option for balancing performance and model accuracy, but introduces more complex programming and optimization challenges. To address these issues, we design a systematic top-down performance analysis model for N:M sparsity. Meanwhile, NM-SpMM is proposed as an efficient general N:M sparsity implementation. Based on our performance analysis, NM-SpMM employs a hierarchical blocking mechanism as a general optimization to enhance data locality, while memory access optimization and pipeline design are introduced as sparsity-aware optimization, allowing it to achieve close-to-theoretical peak performance across different sparsity levels. Experimental results show that NM-SpMM is 2.1x faster than nmSPARSE (the state-of-the-art for general N:M sparsity) and 1.4x to 6.3x faster than cuBLAS's dense GEMM operations, closely approaching the theoretical maximum speedup resulting from the reduction in computation due to sparsity. NM-SpMM is open source and publicly available at https://github.com/M-H482/NM-SpMM.