GPU-Initiated Networking for NCCL

📅 2025-11-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Modern AI workloads—particularly Mixture-of-Experts (MoE)—demand ultra-low-latency, fine-grained, GPU-native inter-GPU communication, which existing CPU-coordinated approaches fail to meet. Method: This paper introduces GPU-Initiated Networking (GIN), a novel communication architecture wherein GPU kernels directly initiate communication without CPU intervention. GIN features a three-layer design: an NCCL Core host interface, device-side callable CUDA APIs, and a dual-mode network plugin supporting both direct NIC access via DOCA GPUNetIO and RDMA-compatible proxy mode. Contribution/Results: GIN is the first production-grade communication library to enable fully device-initiated, fine-grained communication while maintaining seamless integration with the NCCL ecosystem. Evaluations on MoE workloads such as DeepEP demonstrate substantial latency reduction in inter-GPU communication, with full backward compatibility with NCCL collective primitives and existing infrastructure.

Technology Category

Application Category

📝 Abstract
Modern AI workloads, especially Mixture-of-Experts (MoE) architectures, increasingly demand low-latency, fine-grained GPU-to-GPU communication with device-side control. Traditional GPU communication follows a host-initiated model, where the CPU orchestrates all communication operations - a characteristic of the CUDA runtime. Although robust for collective operations, applications requiring tight integration of computation and communication can benefit from device-initiated communication that eliminates CPU coordination overhead. NCCL 2.28 introduces the Device API with three operation modes: Load/Store Accessible (LSA) for NVLink/PCIe, Multimem for NVLink SHARP, and GPU-Initiated Networking (GIN) for network RDMA. This paper presents the GIN architecture, design, semantics, and highlights its impact on MoE communication. GIN builds on a three-layer architecture: i) NCCL Core host-side APIs for device communicator setup and collective memory window registration; ii) Device-side APIs for remote memory operations callable from CUDA kernels; and iii) A network plugin architecture with dual semantics (GPUDirect Async Kernel-Initiated and Proxy) for broad hardware support. The GPUDirect Async Kernel-Initiated backend leverages DOCA GPUNetIO for direct GPU-to-NIC communication, while the Proxy backend provides equivalent functionality via lock-free GPU-to-CPU queues over standard RDMA networks. We demonstrate GIN's practicality through integration with DeepEP, an MoE communication library. Comprehensive benchmarking shows that GIN provides device-initiated communication within NCCL's unified runtime, combining low-latency operations with NCCL's collective algorithms and production infrastructure.
Problem

Research questions and friction points this paper is trying to address.

Enabling GPU-initiated low-latency communication for modern AI workloads
Eliminating CPU coordination overhead in GPU-to-GPU communication
Providing device-side control for fine-grained collective operations
Innovation

Methods, ideas, or system contributions that make the work stand out.

GPU-initiated RDMA networking eliminates CPU coordination overhead
Three-layer architecture with device-side APIs for remote memory operations
Dual semantics support with direct GPU-NIC communication and proxy backends
🔎 Similar Papers
No similar papers found.
Khaled Hamidouche
Khaled Hamidouche
NVIDIA Corporation
J
John Bachan
NVIDIA Corporation
P
Pak Markthub
NVIDIA Corporation
P
Peter-Jan Gootzen
NVIDIA Corporation
E
Elena Agostini
NVIDIA Corporation
S
Sylvain Jeaugey
NVIDIA Corporation
Aamir Shafi
Aamir Shafi
Senior Software Architect, NVIDIA
High Performance ComputingParallel ComputingHigh Performance Deep LearningBig Data
G
Georgios Theodorakis
NVIDIA Corporation
M
Manjunath Gorentla Venkata
NVIDIA Corporation