GICC: A High-Performance Runtime for GPU-Initiated Communication and Coordination in Modern HPC Systems

📅 2026-04-23

📈 Citations: 0

✨ Influential: 0

career value

269K/year

🤖 AI Summary

This work addresses the inability of GPUs in current high-performance computing (HPC) systems to autonomously initiate cross-node communication, particularly the lack of an efficient, low-overhead GPU-driven communication mechanism on OFI-based interconnects such as Slingshot, alongside inefficient NIC resource reclamation. The authors propose GICC, a runtime system that, for the first time on OFI architectures, enables GPUs to directly trigger NIC operations without host intervention, facilitating fine-grained overlap of computation and communication. GICC also introduces an asynchronous, lock-free, lightweight resource reclamation mechanism. Experimental results demonstrate a 229× reduction in coordination latency and a 25% improvement in weak scaling efficiency on Slingshot; on InfiniBand, it achieves 1.95× lower Put latency compared to NVSHMEM. In an industrial-scale stencil application, GICC attains 42% parallel efficiency, significantly outperforming MPI’s 35.4%.

Technology Category

Application Category

📝 Abstract

Distributed GPU applications increasingly rely on kernel-level, cross-node coordination to reduce launch overheads and improve compute-communication overlap, but such support is lacking. On OFI-based interconnects such as HPE Slingshot, which powers six of the top ten systems in the November 2025 Top500, including the top three, GPU kernels cannot autonomously drive distributed coordination: existing runtimes rely on host-driven progress and lack a bounded mechanism for recycling pre-staged NIC work across repeated GPU-triggered operations. On InfiniBand, GPU-initiated communication is possible, but current implementations incur unnecessary synchronization and locking overheads. This paper presents GICC, a framework that enables GPU kernels to directly trigger NIC-level operations without host involvement on the fast path. In stencils, GPU threads initiate halo exchanges as soon as boundary regions are computed, enabling fine-grained overlap between interior computation and boundary transfer. GICC decouples coordination semantics from data movement and introduces asynchronous resource reclamation: the NIC signals completion to both GPU and host memory, letting a lightweight host thread recycle NIC resources concurrently with GPU execution without injecting latency into the coordination path. This sustains GPU-driven coordination under finite NIC state, absent from existing OFI-based runtimes. We implement GICC on NVIDIA and AMD GPUs over InfiniBand and Slingshot. On Slingshot, GICC reduces per-coordination latency by up to 229x and improves weak scaling efficiency by up to 25%. On InfiniBand, it achieves up to 1.95x lower put latency than NVSHMEM by eliminating unnecessary locking and synchronization. On an industrial stencil proxy on 64 AMD MI250X GCDs, GPU-aware MPI incurs over 52% higher communication time than GICC, which achieves 42% parallel efficiency versus MPI's 35.4%.

Problem

Research questions and friction points this paper is trying to address.

GPU-initiated communication

cross-node coordination

NIC resource reclamation

host-driven progress

communication-computation overlap

Innovation

Methods, ideas, or system contributions that make the work stand out.

GPU-initiated communication

NIC offload

asynchronous resource reclamation