UCCL-EP: Portable Expert-Parallel Communication

📅 2025-12-22

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

Existing expert parallelism (EP) communication systems (e.g., DeepEP) rely on GPU-direct RDMA, suffering poor portability across heterogeneous GPU/NIC platforms (e.g., AMD GPUs with Broadcom NICs). This work proposes a novel EP communication framework that eliminates GPU-to-NIC direct writes, instead employing GPU-CPU collaborative control channels and multithreaded CPU agents to orchestrate GPUDirect RDMA. It introduces the first RDMA immediate-data–based mechanism for emulating communication ordering semantics, enabling high-performance, portable EP communication across vendor-diverse hardware—including NVIDIA and AMD GPUs, and Amazon EFA and Broadcom NICs. Experiments demonstrate: (i) 2.1× higher scheduling and aggregation throughput on EFA versus state-of-the-art; (ii) performance on NVIDIA platforms matching DeepEP; (iii) 40% higher token throughput in SGLang; and (iv) 45% improved training throughput for DeepSeek-V3 on AMD+Broadcom platforms.

Technology Category

Application Category

📝 Abstract

Mixture-of-Experts (MoE) workloads rely on expert parallelism (EP) to achieve high GPU efficiency. State-of-the-art EP communication systems such as DeepEP demonstrate strong performance but exhibit poor portability across heterogeneous GPU and NIC platforms. The poor portability is rooted in architecture: GPU-initiated token-level RDMA communication requires tight vertical integration between GPUs and NICs, e.g., GPU writes to NIC driver/MMIO interfaces. We present UCCL-EP, a portable EP communication system that delivers DeepEP-level performance across heterogeneous GPU and NIC hardware. UCCL-EP replaces GPU-initiated RDMA with a high-throughput GPU-CPU control channel: compact token-routing commands are transferred to multithreaded CPU proxies, which then issue GPUDirect RDMA operations on behalf of GPUs. UCCL-EP further emulates various ordering semantics required by specialized EP communication modes using RDMA immediate data, enabling correctness on NICs that lack such ordering, e.g., AWS EFA. We implement UCCL-EP on NVIDIA and AMD GPUs with EFA and Broadcom NICs. On EFA, it outperforms the best existing EP solution by up to $2.1 imes$ for dispatch and combine throughput. On NVIDIA-only platform, UCCL-EP achieves comparable performance to the original DeepEP. UCCL-EP also improves token throughput on SGLang by up to 40% on the NVIDIA+EFA platform, and improves DeepSeek-V3 training throughput over the AMD Primus/Megatron-LM framework by up to 45% on a 16-node AMD+Broadcom platform.

Problem

Research questions and friction points this paper is trying to address.

Improves portability of expert parallelism across heterogeneous GPU and NIC platforms

Replaces GPU-initiated RDMA with a high-throughput GPU-CPU control channel

Enables correct operation on NICs lacking required ordering semantics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses GPU-CPU control channel for portability

Emulates ordering semantics with RDMA immediate data

Achieves high performance across heterogeneous GPU and NIC platforms

🔎 Similar Papers

Enhancing Large-Scale AI Training Efficiency: The C4 Solution for Real-Time Anomaly Detection and Communication Optimization