🤖 AI Summary
This work addresses the underutilization of modern GPU asynchronous features—such as Tensor Memory Accelerator (TMA) and warp specialization—in existing sparse matrix-matrix multiplication (SpMM) methods, which limits performance scalability. The paper presents the first systematic exploration of GPU asynchronous architectures for SpMM, introducing two co-optimized kernels tailored to structured and unstructured sparsity patterns. For structured sparsity, it constructs a warp-specialized producer-consumer pipeline based on BCSR that overlaps TMA data transfers with WGMMA computations. For unstructured sparsity, it proposes a novel format, WCSR, which leverages TMA for efficient sparse data loading and achieves load balancing via large row windows spanning multiple thread blocks. Experiments demonstrate that WCSR outperforms AccSpMM by 1.47× and cuSPARSE by 6.24× on SuiteSparse matrices, while the BCSR kernel delivers a 2.66× end-to-end speedup in the prefill stage of Qwen2.5-7B with 90% block sparsity and 64K tokens.
📝 Abstract
Sparse Matrix-Matrix Multiplication (SpMM) is a fundamental kernel across scientific computing and machine learning. While prior work accelerates SpMM using Tensor Cores, no existing sparse kernel exploits the asynchronous features of modern GPU architectures, such as NVIDIA's Tensor Memory Accelerator (TMA) and warp specialization. This work systematically studies how these features impact SpMM performance and introduces two co-designed kernels. For structured sparsity, we optimize a warp-specialized producer-consumer pipeline overlapping TMA data transfer with WGMMA computation using Block Compressed Sparse Row (BCSR) format. For irregular sparsity, we design a Window Compressed Sparse Row (WCSR) kernel that loads the sparse operand via TMA and splits large row-windows across thread blocks for load balancing. Our WCSR kernel outperforms all prior SpMM kernels on SuiteSparse matrices (1.47x over AccSpMM, 6.24x over cuSPARSE). Our BCSR kernel achieves a combined 2.66x end-to-end speedup on Qwen2.5-7B prefill at 90% block sparsity with 64K tokens over cuDNN/cuBLAS.