Performant Unified GPU Kernels for Portable Singular Value Computation Across Hardware and Precision

๐Ÿ“… 2025-08-08
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing GPU-accelerated singular value decomposition (SVD) algorithms struggle to balance portability across heterogeneous hardware platforms and efficiency across mixed-precision regimes. Method: This paper proposes a unified, QR-iteration-based GPU-accelerated SVD implementation. It is the first to support both Apple Metal GPUs and FP16 half-precision arithmetic within a single kernel, leveraging Juliaโ€™s multiple dispatch and metaprogramming capabilities alongside GPUArrays/KernelAbstractions to construct a hardware-agnostic parallel abstraction layer. Results: For matrices larger than 1024ร—1024, the implementation outperforms MAGMA and SLATE; on large-scale matrices, it achieves 80โ€“90% of cuSOLVERโ€™s performance. Crucially, it significantly reduces dependence on the CUDA/NVIDIA ecosystem, delivering high-efficiency, cross-platform SVD infrastructure for scientific computing and low-rank adaptation in large language models.

Technology Category

Application Category

๐Ÿ“ Abstract
This paper presents a portable, GPU-accelerated implementation of a QR-based singular value computation algorithm in Julia. The singular value ecomposition (SVD) is a fundamental numerical tool in scientific computing and machine learning, providing optimal low-rank matrix approximations. Its importance has increased even more in large-scale machine learning pipelines, including large language models (LLMs), where it enables low-rank adaptation (LoRA). The implemented algorithm is based on the classic two-stage QR reduction, consisting of successive matrix reduction to band form and bidiagonal form. Our implementation leverages Julia's multiple dispatch and metaprogramming capabilities, integrating with the GPUArrays and KernelAbstractions frameworks to provide a unified type and hardware-agnostic function. It supports diverse GPU architectures and data types, and is, to our knowledge, the first GPU-accelerated singular value implementation to support Apple Metal GPUs and half precision. Performance results on multiple GPU backends and data types demonstrate that portability does not require sacrificing performance: the unified function outperforms most linear algebra libraries (MAGMA, SLATE, rocSOLVER, oneMKL) for matrix sizes larger than 1024x1024, and achieves 80%-90% of the performance of cuSOLVER for large matrices.
Problem

Research questions and friction points this paper is trying to address.

Portable GPU-accelerated SVD computation across hardware
Support for diverse GPU architectures and data types
High performance QR-based SVD for large matrices
Innovation

Methods, ideas, or system contributions that make the work stand out.

Portable GPU-accelerated QR-based SVD in Julia
Unified type and hardware-agnostic GPU function
Supports Apple Metal GPUs and half precision
๐Ÿ”Ž Similar Papers
No similar papers found.