🤖 AI Summary
GPU offloading of BLAS/LAPACK libraries incurs substantial overhead and reduced practicality due to explicit data movement across CPU-GPU boundaries.
Method: This paper proposes a zero-intrusion automatic offloading framework for NVIDIA Grace-Hopper’s unified memory architecture—requiring no source-code modification or recompilation. It leverages runtime BLAS interception and redirection, Unified Virtual Memory (UVM) management, NVLink Chip-to-Chip cache-coherent interconnect, and CUDA Graph-based scheduling optimization to enable transparent GPU acceleration of BLAS calls under CPU-GPU collaboration.
Contribution/Results: We present the first hardware-coherence-driven zero-copy offloading on Grace-Hopper, eliminating data migration bottlenecks inherent in conventional heterogeneous architectures. Evaluated on two quantum chemistry/physics codes, our approach achieves multi-fold speedups while preserving the original CPU-centric programming model and drastically reducing porting effort.
📝 Abstract
Porting codes to GPU often requires major efforts. While several tools exist for automatically offload numerical libraries such as BLAS and LAPACK, they often prove impractical due to the high cost of mandatory data transfer. The new unified memory architecture in NVIDIA Grace-Hopper allows high bandwidth cache-coherent memory access of all memory from both CPU and GPU, potentially eliminating bottleneck faced in conventional architecture. This breakthrough opens up new avenues for application development and porting strategies. In this study, we introduce a new tool for automatic BLAS offload, the tool leverages the high speed cache coherent NVLink C2C interconnect in Grace-Hopper, and enables performant GPU offload for BLAS heavy applications with no code changes or recompilation. The tool was tested on two quantum chemistry or physics codes, great performance benefits were observed.