๐ค AI Summary
To address performance bottlenecks of sparse matrixโvector multiplication (SpMV) on GPUs under large-scale, highly sparse workloads, this paper proposes a co-optimization framework combining nonlinear hashing-based matrix reordering with 2D blocking. We pioneer the use of nonlinear hash mapping for structural reordering and introduce a lightweight Hash-based Partition (HBP) storage format that jointly exploits hash-induced clustering and 2D block locality. Furthermore, we design a contention-aware parallel load-balancing mechanism to significantly reduce preprocessing overhead. Experimental results show that our preprocessing phase achieves 3.53ร and 3.67ร speedup over conventional sorting and Regu2D dynamic programming, respectively. In SpMV computation, HBP delivers up to 3.32ร and 3.01ร acceleration over CSR on Jetson AGX Orin and RTX 4090, respectively. The method thus bridges the gap between efficient preprocessing and high-throughput GPU SpMV execution for extreme sparsity regimes.
๐ Abstract
Sparse matrix-vector multiplication (SpMV) is a fundamental operation with a wide range of applications in scientific computing and artificial intelligence. However, the large scale and sparsity of sparse matrix often make it a performance bottleneck. In this paper, we highlight the effectiveness of hash-based techniques in optimizing sparse matrix reordering, introducing the Hash-based Partition (HBP) format, a lightweight SpMV approach. HBP retains the performance benefits of the 2D-partitioning method while leveraging the hash transformation's ability to group similar elements, thereby accelerating the pre-processing phase of sparse matrix reordering. Additionally, we achieve parallel load balancing across matrix blocks through a competitive method. Our experiments, conducted on both Nvidia Jetson AGX Orin and Nvidia RTX 4090, show that in the pre-processing step, our method offers an average speedup of 3.53 times compared to the sorting approach and 3.67 times compared to the dynamic programming method employed in Regu2D. Furthermore, in SpMV, our method achieves a maximum speedup of 3.32 times on Orin and 3.01 times on RTX4090 against the CSR format in sparse matrices from the University of Florida Sparse Matrix Collection.