Mixed-Precision Performance Portability of FFT-Based GPU-Accelerated Algorithms for Block-Triangular Toeplitz Matrices

📅 2025-08-13

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Hardware heterogeneity and underutilized performance potential of low-precision computation in exascale supercomputers hinder efficient acceleration of structured linear algebra. Method: This paper proposes an FFT-accelerated mixed-precision GPU algorithm tailored for block-triangular Toeplitz matrices. It introduces a dynamic mixed-precision framework, leveraging Pareto frontier analysis to achieve optimal accuracy–performance trade-offs. Via Hipify-based source-level porting, deep ROCm ecosystem optimization, and rocBLAS integration, the work achieves the first seamless migration and extension of CUDA-only FFT algorithms to AMD GPUs (MI250X/MI300X/MI355X). Contributions/Results: (1) High-performance, cross-architecture (especially AMD) mixed-precision portability; (2) Strong scaling across 2048 GPUs on the OLCF Frontier supercomputer; (3) Zero-code-modification deployment, significantly enhancing computational efficiency and applicability across diverse HPC platforms.

Technology Category

Application Category

📝 Abstract

The hardware diversity displayed in leadership-class computing facilities, alongside the immense performance boosts exhibited by today's GPUs when computing in lower precision, provide a strong incentive for scientific HPC workflows to adopt mixed-precision algorithms and performance portability models. We present an on-the-fly framework using Hipify for performance portability and apply it to FFTMatvec-an HPC application that computes matrix-vector products with block-triangular Toeplitz matrices. Our approach enables FFTMatvec, initially a CUDA-only application, to run seamlessly on AMD GPUs with excellent observed performance. Performance optimizations for AMD GPUs are integrated directly into the open-source rocBLAS library, keeping the application code unchanged. We then present a dynamic mixed-precision framework for FFTMatvec; a Pareto front analysis determines the optimal mixed-precision configuration for a desired error tolerance. Results are shown for AMD Instinct MI250X, MI300X, and the newly launched MI355X GPUs. The performance-portable, mixed-precision FFTMatvec is scaled to 2,048 GPUs on the OLCF Frontier supercomputer.

Problem

Research questions and friction points this paper is trying to address.

Enabling mixed-precision algorithms for HPC workflows

Achieving performance portability across diverse GPU architectures

Optimizing FFT-based matrix-vector products for block-triangular Toeplitz matrices

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hipify for GPU performance portability

Dynamic mixed-precision framework optimization

rocBLAS integration for AMD GPU acceleration

🔎 Similar Papers

No similar papers found.