Fully-Automated Code Generation for Efficient Computation of Sparse Matrix Permanents on GPUs

📅 2025-01-25

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Computing the permanent of sparse matrices on GPUs is notoriously inefficient—particularly when the sparsity pattern is only known at runtime, preventing effective compiler optimizations. To address this, we propose an automated CUDA kernel generation framework. Our key contributions are: (1) a novel code-generation strategy that fully maps per-thread private arrays to GPU registers; (2) a low-divergence iterative scheduling scheme based on Gray codes; and (3) sparsity-pattern-aware matrix reordering and tiling, jointly optimizing register utilization and global memory access. Experiments on synthetic and real-world sparse matrices show that our approach achieves 31× and 24.9× speedup over state-of-the-art 112-core CPU algorithms, and 8× and 4.9× speedup over conventional GPU implementations, respectively—significantly advancing the performance frontier for sparse permanent computation.

Technology Category

Application Category

📝 Abstract

Registers are the fastest memory components within the GPU's complex memory hierarchy, accessed by names rather than addresses. They are managed entirely by the compiler through a process called register allocation, during which the compiler attempts to cache predictable data from thread-local memory into thread-private registers. Computing the permanent of a sparse matrix poses a challenge for compilers, as optimizing this process is hindered by the unpredictable distribution of nonzero elements, which only become known at runtime. In this work, we employ fully-automated code generation to address this, producing highly optimized kernels tailored to the matrix's sparsity pattern. State-of-the-art permanent computation algorithms require each thread to store a private array, denoted x, of size n. We first propose a technique that fully stores these arrays in registers, with inclusion and exclusion kernels generated for each column. To minimize control divergence and reduce the number of unique kernels within a warp, we exploit the internal structure of Gray codes, which are also used in the state-of-the-art algorithm. Our second technique reduces register pressure by utilizing both registers and global memory and introduces a matrix ordering and partitioning strategy for greater efficiency. On synthetic matrices, this approach achieves a 31x speedup over state-of-the-art CPU implementations on 112 cores, and an 8x speedup compared to our traditional GPU implementation. For real-world matrices, these speedups are 24.9x and 4.9x.

Problem

Research questions and friction points this paper is trying to address.

Sparse Matrix

Permanents Calculation

GPU Optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Matrix

Auto-Code Generation

GPU Optimization

🔎 Similar Papers

SABLE: Staging Blocked Evaluation of Sparse Matrix Computations