🤖 AI Summary
To address the efficiency bottleneck in stencil computations on tensor cores (TCs)—caused by redundant zero-padding when converting stencil operators into dense matrix multiplication—this paper proposes SpTC, the first sparse tensor core acceleration paradigm tailored for scientific computing. Our method introduces a stride-swapping–driven sparse transformation that losslessly maps stencil operators onto native sparse GEMM formats supported by SpTC hardware. We further design high-performance GPU kernels and system-level optimizations co-designed with SpTC’s architectural features. This work pioneers the extension of sparse tensor cores beyond deep learning into stencil-based scientific simulations. Experimental evaluation demonstrates average speedups of 5.46× over conventional CPU/GPU implementations and 2.00× over dense TC-based approaches, significantly unlocking the performance potential of sparse hardware accelerators.
📝 Abstract
Stencil computation, a pivotal numerical method in science and engineering, iteratively updates grid points using weighted neighbor contributions and exhibits strong parallelism for multi-core processors. Current optimization techniques targeting conducting stencil computation on tensor core accelerators incur substantial overheads due to redundant zero-padding during the transformation to matrix multiplication. To address this, we introduce a sparse computation paradigm that eliminates inefficiencies by exploiting specialized hardware units.
This paper exploits the sparsity in these matrices as a feature and presents SPTCStencil, a high-performance stencil computation system accelerated by Sparse Tensor Core (SpTCs). SPTCStencil is the first to harness SpTCs for acceleration beyond deep learning domains. First, Our approach generalizes an efficient transformation of stencil computation into matrix multiplications and specializes this conversion for SpTC compatibility through a novel sparsification strategy. Furthermore, SPTCStencil incorporates a high-performance GPU kernel with systematic optimizations designed to maximize efficiency on SpTCs. Experimental evaluations demonstrate that SPTCStencil 5.46$ imes$ and Tensor Core-based approaches by 2.00$ imes$ on average.