🤖 AI Summary
This work addresses the inefficiency of executing double-precision matrix multiplication—common in traditional HPC applications such as MuST—on GPUs. We propose a source-code-transparent, tunable-precision simulation method that avoids algorithmic rewriting. Our approach integrates automatic BLAS offloading, INT8 low-bit integer computation, cache-coherent unified memory, and AI-driven adaptive precision scheduling. Crucially, it preserves the original double-precision algorithmic logic while dynamically adapting arithmetic precision and operator characteristics to hardware constraints. This establishes the first “fidelity–efficiency co-design” simulation paradigm, overcoming the limitations of conventional mixed-precision methods that require manual algorithm refactoring. Experiments demonstrate substantial improvements in GPU utilization and execution throughput, alongside controllable trade-offs between numerical accuracy and performance. The framework provides a novel pathway for leveraging AI-accelerated hardware in scientific computing.
📝 Abstract
This study explores the use of automatic BLAS offloading and INT8-based emulation for accelerating traditional HPC workloads on modern GPU architectures. Through the use of low-bitwidth integer units and cache-coherent Unified Memory Architecture, we emulate double-precision matrix multiplications in the MuST application without code changes. We find that accuracy depends on both arithmetic precision and the properties of the operator, which can be dealt with through tunable precision emulation. Unlike traditional mixed-precision approaches, this method preserves original algorithms while optimizing hardware utilization. We showcases the potential of improving accuracy and performance at the same time. This work highlights the potential of AI-driven hardware to transform HPC, advocating for adaptive precision strategies in future scientific computing.