NM-SpMM: Accelerating Matrix Multiplication Using extit{N:M} Sparsity with GPGPU

📅 2025-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational and memory overheads caused by dense weights in deep learning model deployment, this paper proposes an efficient GPU sparse matrix multiplication (SpMM) framework tailored for general N:M sparsity patterns. We introduce a novel top-down performance modeling methodology specifically designed for N:M sparsity, enabling systematic analysis and optimization. Our approach features a hierarchical, block-wise execution strategy with sparsity-aware memory access scheduling and instruction-level pipelining optimizations—achieving, for the first time on general N:M sparse matrices, performance close to the theoretical peak. Experimental evaluation demonstrates average speedups of 2.1× over nmSPARSE and 1.4–6.3× over cuBLAS dense GEMM across diverse benchmarks; our implementation attains over 92% of the theoretical sparsity-bound performance ceiling. The open-source implementation is publicly available on GitHub.

Technology Category

Application Category

📝 Abstract
Deep learning demonstrates effectiveness across a wide range of tasks. However, the dense and over-parameterized nature of these models results in significant resource consumption during deployment. In response to this issue, weight pruning, particularly through N:M sparsity matrix multiplication, offers an efficient solution by transforming dense operations into semi-sparse ones. N:M sparsity provides an option for balancing performance and model accuracy, but introduces more complex programming and optimization challenges. To address these issues, we design a systematic top-down performance analysis model for N:M sparsity. Meanwhile, NM-SpMM is proposed as an efficient general N:M sparsity implementation. Based on our performance analysis, NM-SpMM employs a hierarchical blocking mechanism as a general optimization to enhance data locality, while memory access optimization and pipeline design are introduced as sparsity-aware optimization, allowing it to achieve close-to-theoretical peak performance across different sparsity levels. Experimental results show that NM-SpMM is 2.1x faster than nmSPARSE (the state-of-the-art for general N:M sparsity) and 1.4x to 6.3x faster than cuBLAS's dense GEMM operations, closely approaching the theoretical maximum speedup resulting from the reduction in computation due to sparsity. NM-SpMM is open source and publicly available at https://github.com/M-H482/NM-SpMM.
Problem

Research questions and friction points this paper is trying to address.

Accelerates matrix multiplication using N:M sparsity on GPGPU.
Addresses resource consumption in dense deep learning models.
Optimizes performance and accuracy with hierarchical blocking and memory access.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical blocking mechanism for data locality
Memory access optimization for sparsity-aware performance
Pipeline design to approach theoretical peak performance
🔎 Similar Papers
No similar papers found.
C
Cong Ma
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China
Du Wu
Du Wu
Tokyo Institute of Technology
High Performance Computing (HPC)
Z
Zhelang Deng
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China; Guangdong Institute of Intelligence Science and Technolog, China
J
Jiang Chen
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China; Tencent AI Lab, China
X
Xiaowen Huang
Shenzhen University, China
J
Jintao Meng
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China
Wenxi Zhu
Wenxi Zhu
Tencent AI Lab, China
B
Bingqiang Wang
Peng Cheng Laboratory, China
Amelie Chi Zhou
Amelie Chi Zhou
Assistant Professor, HKBU, Hong Kong
High performance computingCloud computingBig data analytics
P
Peng Chen
RIKEN Center for Computational Science, Japan
M
Minwen Deng
Tencent AI Lab, China
Yanjie Wei
Yanjie Wei
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China
Shengzhong Feng
Shengzhong Feng
Guangdong Institute of Intelligence Science and Technolog, China
Y
Yi Pan
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China