A Nonlinear Hash-based Optimization Method for SpMV on GPUs

๐Ÿ“… 2025-04-11
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address performance bottlenecks of sparse matrixโ€“vector multiplication (SpMV) on GPUs under large-scale, highly sparse workloads, this paper proposes a co-optimization framework combining nonlinear hashing-based matrix reordering with 2D blocking. We pioneer the use of nonlinear hash mapping for structural reordering and introduce a lightweight Hash-based Partition (HBP) storage format that jointly exploits hash-induced clustering and 2D block locality. Furthermore, we design a contention-aware parallel load-balancing mechanism to significantly reduce preprocessing overhead. Experimental results show that our preprocessing phase achieves 3.53ร— and 3.67ร— speedup over conventional sorting and Regu2D dynamic programming, respectively. In SpMV computation, HBP delivers up to 3.32ร— and 3.01ร— acceleration over CSR on Jetson AGX Orin and RTX 4090, respectively. The method thus bridges the gap between efficient preprocessing and high-throughput GPU SpMV execution for extreme sparsity regimes.

Technology Category

Application Category

๐Ÿ“ Abstract
Sparse matrix-vector multiplication (SpMV) is a fundamental operation with a wide range of applications in scientific computing and artificial intelligence. However, the large scale and sparsity of sparse matrix often make it a performance bottleneck. In this paper, we highlight the effectiveness of hash-based techniques in optimizing sparse matrix reordering, introducing the Hash-based Partition (HBP) format, a lightweight SpMV approach. HBP retains the performance benefits of the 2D-partitioning method while leveraging the hash transformation's ability to group similar elements, thereby accelerating the pre-processing phase of sparse matrix reordering. Additionally, we achieve parallel load balancing across matrix blocks through a competitive method. Our experiments, conducted on both Nvidia Jetson AGX Orin and Nvidia RTX 4090, show that in the pre-processing step, our method offers an average speedup of 3.53 times compared to the sorting approach and 3.67 times compared to the dynamic programming method employed in Regu2D. Furthermore, in SpMV, our method achieves a maximum speedup of 3.32 times on Orin and 3.01 times on RTX4090 against the CSR format in sparse matrices from the University of Florida Sparse Matrix Collection.
Problem

Research questions and friction points this paper is trying to address.

Optimizing SpMV performance on GPUs for sparse matrices
Introducing Hash-based Partition format for efficient matrix reordering
Achieving parallel load balancing across matrix blocks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hash-based Partition (HBP) format for SpMV
Parallel load balancing across matrix blocks
Hash transformation groups similar elements
๐Ÿ”Ž Similar Papers
No similar papers found.
Chen Yan
Chen Yan
Associate Professor, Zhejiang University, College of EE
CPS SecurityEmbedded System SecuritySensor Security
B
Boyu Diao
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
H
Hangda Liu
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
Zhulin An
Zhulin An
Institute Of Computing Technology Chinese Academy Of Sciences
Automatic Deep LearningLifelong Learning
Y
Yongjun Xu
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China