🤖 AI Summary
This work addresses the lack of efficient solutions for computing a large batch of small-scale singular value decompositions (SVDs) on GPUs. The authors propose a GPU-accelerated batched SVD solver based on the one-sided Jacobi algorithm, co-designed with hardware architecture to exploit fine-grained parallelism, optimize memory access patterns, and support multiple floating-point precisions. Implemented on both NVIDIA and AMD GPU platforms, the solver demonstrates exceptional robustness and scalability across diverse matrix shapes, conditioning numbers, and precision configurations. Experimental results show that the proposed method significantly outperforms existing vendor-provided libraries and open-source solvers in terms of computational performance while maintaining numerical reliability.
📝 Abstract
The singular value decomposition (SVD) is a powerful tool in modern numerical linear algebra, which underpins computational methods such as principal component analysis (PCA), low-rank approximations, and randomized algorithms. Many practical scenarios require solving numerous small SVD problems, a regime generally referred to as"batch SVD". Existing programming models can handle this efficiently on parallel CPU architectures, but high-performance solutions for GPUs remain immature. A GPU-oriented batch SVD solver is introduced. This solver exploits the one-sided Jacobi algorithm to exploit fine-grained parallelism, and a number of algorithmic and design optimizations achieve unmatched performance. Starting from a baseline solver, a sequence of optimizations is applied to obtain incremental performance gains. Numerical experiments show that the new solver is robust across problems with different numerical properties, matrix shapes, and arithmetic precisions. Performance benchmarks on both NVIDIA and AMD systems show significant performance speedups over vendor solutions as well as existing open-source solvers.