🤖 AI Summary
To address efficiency and energy-efficiency bottlenecks of General Matrix Multiplication (GEMM) on heterogeneous hardware, this work proposes a fine-grained adaptive mixed-precision framework. Unlike conventional layer- or tensor-level coarse-grained approaches, it dynamically selects optimal numerical precisions (e.g., FP16/FP32/FP64) at the block level, tightly coupling precision selection with hardware-aware block scheduling. The framework integrates the PaRSEC runtime to enable cross-architecture task load balancing and low-overhead precision transitions across ARM CPUs, NVIDIA GPUs, and AMD GPUs. This bridges the gap between algorithmic numerical robustness requirements and hardware-specific computational capability and energy-efficiency characteristics. Evaluations on supercomputing platforms—including Fugaku, Frontier, and NVIDIA A100 DGX—demonstrate up to 2.1× speedup and 1.8× energy-efficiency improvement over single-precision baselines, while preserving numerical stability critical for scientific applications.
📝 Abstract
General Matrix Multiplication (GEMM) is a critical operation underpinning a wide range of applications in high-performance computing (HPC) and artificial intelligence (AI). The emergence of hardware optimized for low-precision arithmetic necessitates a reevaluation of numerical algorithms to leverage mixed-precision computations, achieving improved performance and energy efficiency. This research introduces an adaptive mixed-precision GEMM framework that supports different precision formats at fine-grained tile/block levels. We utilize the PaRSEC runtime system to balance workloads across various architectures. The performance scales well on ARM CPU-based Fugaku supercomputer, Nvidia GPU-based A100 DGX, and AMD GPU-based Frontier supercomputer. This research aims to enhance computational efficiency and accuracy by bridging algorithmic advancements and hardware innovations, driving transformative progress in various applications.