🤖 AI Summary
To address memory-bandwidth bottlenecks—particularly in sparse matrix computations—that impede performance in scientific computing, this paper proposes a mixed-precision acceleration framework tailored for exascale GPU supercomputing platforms. The method synergistically combines double-precision (FP64) with single- or half-precision (FP32/FP16) arithmetic, employing low-precision data formats in core iterative steps while preserving numerical robustness via high-precision residual correction. It integrates an optimized GMRES solver with customized sparse matrix storage and memory-access strategies. This work presents the first end-to-end mixed-precision sparse linear solver deployed on modern GPU-based exascale systems and introduces HPG-MxP, a lightweight benchmark for mixed-precision sparse solvers. Experiments demonstrate a 1.6× speedup over FP64-only baselines while maintaining solution accuracy, significantly enhancing practical throughput for memory-bound scientific simulations. The approach delivers a deployable, production-ready solution to the “memory wall” challenge.
📝 Abstract
Mixed-precision algorithms have been proposed as a way for scientific computing to benefit from some of the gains seen for artificial intelligence (AI) on recent high performance computing (HPC) platforms. A few applications dominated by dense matrix operations have seen substantial speedups by utilizing low precision formats such as FP16. However, a majority of scientific simulation applications are memory bandwidth limited. Beyond preliminary studies, the practical gain from using mixed-precision algorithms on a given HPC system is largely unclear.
The High Performance GMRES Mixed Precision (HPG-MxP) benchmark has been proposed to measure the useful performance of a HPC system on sparse matrix-based mixed-precision applications. In this work, we present a highly optimized implementation of the HPG-MxP benchmark for an exascale system and describe our algorithm enhancements. We show for the first time a speedup of 1.6x using a combination of double- and single-precision on modern GPU-based supercomputers.