NCCL EP: Towards a Unified Expert Parallel Communication API for NCCL

📅 2026-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of unified and efficient native support for Mixture-of-Experts (MoE) communication in existing systems, which typically rely on specialized communication libraries. We present the first native implementation of MoE communication within NCCL by leveraging the NCCL Device API, introducing two unified primitives—ncclEpDispatch and ncclEpCombine—that simultaneously optimize for low-latency small-batch scenarios (e.g., inference/decoding) and high-throughput large-batch workloads (e.g., training/prefill). Our design incorporates GPU-initiated RDMA, NVLink topology-aware communication, double buffering, and hierarchical scheduling to significantly enhance communication efficiency. Evaluated on multi-node H100 clusters, our approach achieves state-of-the-art low latency and demonstrates end-to-end inference acceleration when integrated into vLLM.

Technology Category

Application Category

📝 Abstract
Mixture-of-Experts (MoE) architectures have become essential for scaling large language models, driving the development of specialized device-initiated communication libraries such as DeepEP, Hybrid-EP, and others. These libraries demonstrate the performance benefits of GPU-initiated RDMA for MoE dispatch and combine operations. This paper presents NCCL EP (Expert Parallelism), a ground-up MoE communication library built entirely on NCCL's Device API. NCCL EP provides unified ncclEpDispatch and ncclEpCombine primitives with both C and Python interfaces, supporting Low-Latency (LL) mode for inference decoding and High-Throughput (HT) mode for training and inference prefill. LL targets small batch sizes (1-128 tokens) using direct all-to-all RDMA+NVLink mesh connectivity with double-buffered communication for overlapping dispatch and combine phases. HT targets large batches (4096+ tokens) using hierarchical communication that aggregates tokens within NVLink domains before inter-node RDMA transmission. Both modes leverage Device API for both intra- and inter-node communications, taking advantage of its topology awareness and optimized GPU-initiated implementation. We evaluate NCCL EP on an H100-based cluster across multi-node configurations, demonstrating competitive LL kernel performance and presenting end-to-end results with vLLM integration. By building MoE communication natively within NCCL, NCCL EP provides a supported path for expert parallelism on current and emerging NVIDIA platforms.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
Expert Parallelism
NCCL
GPU-initiated communication
MoE communication
Innovation

Methods, ideas, or system contributions that make the work stand out.

NCCL EP
Mixture-of-Experts
Device-initiated communication
Expert Parallelism
Hierarchical all-to-all
🔎 Similar Papers
No similar papers found.
A
Amos Goldman
NVIDIA Corporation
N
Nimrod Boker
NVIDIA Corporation
M
Maayan Sheraizin
NVIDIA Corporation
N
Nimrod Admoni
NVIDIA Corporation
Artem Polyakov
Artem Polyakov
NVIDIA
HPC architectureGPU communicationVirtualization
S
Subhadeep Bhattacharya
NVIDIA Corporation
F
Fan Yu
NVIDIA Corporation
K
Kai Sun
NVIDIA Corporation
G
Georgios Theodorakis
NVIDIA Corporation
H
Hsin-Chun Yin
NVIDIA Corporation
P
Peter-Jan Gootzen
NVIDIA Corporation
Aamir Shafi
Aamir Shafi
Senior Software Architect, NVIDIA
High Performance ComputingParallel ComputingHigh Performance Deep LearningBig Data
A
Assaf Ravid
NVIDIA Corporation
Salvatore Di Girolamo
Salvatore Di Girolamo
NVIDIA
HPCNetworkingMPIRMAComputer architecture
M
Manjunath Gorentla Venkata
NVIDIA Corporation
G
Gil Bloch
NVIDIA Corporation