Guaranteed DGEMM Accuracy While Using Reduced Precision Tensor Cores Through Extensions of the Ozaki Scheme

📅 2025-11-15

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the challenge of efficiently executing double-precision (FP64) matrix multiplication (GEMM) on low-precision tensor cores (FP16/FP8/FP4). We propose ADP, an automatic dynamic-precision framework that extends the Ozaki numerical decomposition with unsigned integer slicing, runtime heuristic scheduling, and exception-aware fallback. Crucially, ADP introduces the Exponent Span Capacity (ESC) estimator to enable adaptive precision selection, and performs all computation entirely on-GPU without host intervention. Our key contribution is the first implementation of rigorously FP64-fidelity GEMM on low-precision hardware, supporting seamless degradation to native FP64. Evaluated on the Blackwell architecture, ADP achieves up to 13.2× speedup over cuBLAS FP64 GEMM, with runtime overhead under 10%, significantly improving energy efficiency for high-precision computation.

Technology Category

Application Category

📝 Abstract

The rapid growth of artificial intelligence (AI) has made low-precision formats such as FP16, FP8, and, most recently, block-scaled FP4 the primary focus of modern GPUs, where Tensor Cores now deliver orders-of-magnitude higher throughput than traditional FP64 pipelines. This hardware shift has sparked a new line of algorithm research: using low-precision units to emulate double-precision accuracy through schemes such as Ozaki decompositions. We advance this direction with Automatic Dynamic Precision (ADP), a fully GPU-resident framework that makes emulated FP64 matrix multiplication both efficient and reliable. At its core is the Exponent Span Capacity (ESC), a hardware-agnostic estimator that conservatively determines the decomposition parameter (also known as slices) required to achieve FP64-level accuracy. Built on ESC, ADP integrates exception handling, run time heuristics, and seamless fallback to native FP64, ensuring correctness without host-device synchronization or user intervention. Additionally, we further improve Ozaki-style decompositions with an unsigned integer slicing scheme, which increases representational efficiency and reduces computational waste. Validated against recently proposed BLAS grading tests, ADP consistently preserves FP64 fidelity on challenging inputs while incurring less than 10% run time overhead. In a 55-bit mantissa setting, our approach achieves up to 2.3x and 13.2x speedups over native FP64 GEMM on NVIDIA Blackwell GB200 and the RTX Pro 6000 Blackwell Server Edition, respectively. Our results demonstrate that low-precision accelerators can serve as a practical, production-ready foundation for high-fidelity and high-performance scientific computing workloads.

Problem

Research questions and friction points this paper is trying to address.

Achieving FP64 accuracy using low-precision Tensor Cores

Developing hardware-agnostic decomposition parameter estimation

Ensuring computational correctness without user intervention

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatic Dynamic Precision framework for GPU-resident emulation

Exponent Span Capacity estimator determines decomposition parameters

Unsigned integer slicing scheme improves representational efficiency

🔎 Similar Papers

No similar papers found.