Anatomy of High-Performance Column-Pivoted QR Decomposition

πŸ“… 2025-07-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Column-pivoted QR decomposition (QRCP) suffers from performance bottlenecks on modern CPU/GPU architectures due to suboptimal memory access patterns, insufficient parallelism exploitation, and lack of hardware-specific optimization. Method: This paper introduces a configurable, hierarchical high-performance algorithmic framework for QRCP. It enables modular composition of core kernels, integrating column pivoting strategies, hardware-aware parallel task scheduling, and deep optimizations tailored to AMD EPYC CPUs and NVIDIA H100 GPUs. Contribution/Results: The framework achieves up to 100Γ— speedup over LAPACK on dual-socket AMD EPYC 9734 systems and attains 65% of cuSOLVER’s unpivoted QR performance on the H100 GPU. Compared to state-of-the-art randomized QRCP methods, it delivers a two-order-of-magnitude overall speedup. Open-sourced and integrated into the RandLAPACK library, the framework provides an efficient, flexible, and cross-platform QRCP implementation for large-scale numerical linear algebra.

Technology Category

Application Category

πŸ“ Abstract
We introduce an algorithmic framework for performing QR factorization with column pivoting (QRCP) on general matrices. The framework enables the design of practical QRCP algorithms through user-controlled choices for the core subroutines. We provide a comprehensive overview of how to navigate these choices on modern hardware platforms, offering detailed descriptions of alternative methods for both CPUs and GPUs. The practical QRCP algorithms developed within this framework are implemented as part of the open-source RandLAPACK library. Our empirical evaluation demonstrates that, on a dual AMD EPYC 9734 system, the proposed method achieves performance improvements of up to two orders of magnitude over LAPACK's standard QRCP routine and greatly surpasses the performance of the current state-of-the-art randomized QRCP algorithm. Additionally, on an NVIDIA H100 GPU, our method attains approximately 65 percent of the performance of cuSOLVER's unpivoted QR factorization.
Problem

Research questions and friction points this paper is trying to address.

Designing efficient QR factorization with column pivoting
Optimizing QRCP algorithms for modern CPUs and GPUs
Achieving significant performance improvements over existing QRCP methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Algorithmic framework for QRCP on general matrices
User-controlled core subroutines for hardware optimization
High-performance implementation in RandLAPACK library
πŸ”Ž Similar Papers
No similar papers found.
M
Maksim Melnichenko
Innovative Computing Laboratory, University of Tennessee, Knoxville
R
Riley Murray
Sandia National Laboratories
W
William Killian
NVIDIA
J
James Demmel
University of California Berkeley
M
Michael W. Mahoney
International Computer Science Institute (ICSI)
Piotr Luszczek
Piotr Luszczek
University of Tennessee
High Performance ComputingPerformance Evaluation and BenchmarkingNumerical Linear Algebra
Mark Gates
Mark Gates
University of Tennessee
scientific computinglinear algebradigital volume correlation