Automatic BLAS Offloading on Unified Memory Architecture: A Study on NVIDIA Grace-Hopper

📅 2024-04-19

🏛️ Practice and Experience in Advanced Research Computing

📈 Citations: 5

✨ Influential: 0

career value

214K/year

🤖 AI Summary

GPU offloading of BLAS/LAPACK libraries incurs substantial overhead and reduced practicality due to explicit data movement across CPU-GPU boundaries. Method: This paper proposes a zero-intrusion automatic offloading framework for NVIDIA Grace-Hopper’s unified memory architecture—requiring no source-code modification or recompilation. It leverages runtime BLAS interception and redirection, Unified Virtual Memory (UVM) management, NVLink Chip-to-Chip cache-coherent interconnect, and CUDA Graph-based scheduling optimization to enable transparent GPU acceleration of BLAS calls under CPU-GPU collaboration. Contribution/Results: We present the first hardware-coherence-driven zero-copy offloading on Grace-Hopper, eliminating data migration bottlenecks inherent in conventional heterogeneous architectures. Evaluated on two quantum chemistry/physics codes, our approach achieves multi-fold speedups while preserving the original CPU-centric programming model and drastically reducing porting effort.

Technology Category

Application Category

📝 Abstract

Porting codes to GPU often requires major efforts. While several tools exist for automatically offload numerical libraries such as BLAS and LAPACK, they often prove impractical due to the high cost of mandatory data transfer. The new unified memory architecture in NVIDIA Grace-Hopper allows high bandwidth cache-coherent memory access of all memory from both CPU and GPU, potentially eliminating bottleneck faced in conventional architecture. This breakthrough opens up new avenues for application development and porting strategies. In this study, we introduce a new tool for automatic BLAS offload, the tool leverages the high speed cache coherent NVLink C2C interconnect in Grace-Hopper, and enables performant GPU offload for BLAS heavy applications with no code changes or recompilation. The tool was tested on two quantum chemistry or physics codes, great performance benefits were observed.

Problem

Research questions and friction points this paper is trying to address.

Reduces effort for GPU porting of BLAS applications

Eliminates data transfer costs in unified memory

Enables performant GPU offload without code changes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages NVIDIA Grace-Hopper unified memory architecture

Uses cache-coherent NVLink C2C for high bandwidth

Enables GPU offload without code changes

🔎 Similar Papers

No similar papers found.