An Extensible Software Transport Layer for GPU Networking

📅 2025-04-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address flow collisions and degraded collective communication performance in AI training caused by inflexible RDMA network transport, this paper proposes UCCL—a scalable GPU-network software transport layer. Our approach adopts a hardware-software co-designed, scalable architecture: it decouples data and control planes on RDMA NICs, migrating control logic to the CPU to enable flexible protocol evolution; and introduces a multipath scheduling mechanism with traffic-balancing algorithms to avoid single-path congestion and overcome inherent hardware limitations. Experimental evaluation demonstrates that UCCL achieves 3.3× higher ML collective communication throughput compared to state-of-the-art industrial solutions, significantly accelerating large-model training.

Technology Category

Application Category

📝 Abstract
Fast-evolving machine learning (ML) workloads have increasing requirements for networking. However, host network transport on RDMA NICs is hard to evolve, causing problems for ML workloads. For example, single-path RDMA traffic is prone to flow collisions that severely degrade collective communication performance. We present UCCL, an extensible software transport layer to evolve GPU networking. UCCL decouples the data path and control path of existing RDMA NICs and efficiently runs the control-path transport on host CPUs. This software extensibility brings in transport innovations that cannot be achieved in hardware for ML workloads, e.g., a multipath transport to resolve flow collisions. ML collectives atop UCCL achieve up to 3.3x higher performance compared to an industry solution.
Problem

Research questions and friction points this paper is trying to address.

Addresses networking requirements for fast-evolving ML workloads
Resolves flow collisions in single-path RDMA traffic
Enhances ML collective communication performance via software transport
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extensible software transport layer UCCL
Decouples RDMA NIC data and control paths
Multipath transport resolves flow collisions
🔎 Similar Papers
No similar papers found.
Y
Yang Zhou
UC Berkeley, UC Davis
Z
Zhongjie Chen
Tsinghua University
Ziming Mao
Ziming Mao
UC Berkeley
Distributed SystemsBig DataAI Systems
C
ChonLam Lao
Harvard University
S
Shuo Yang
UC Berkeley
P
Pravein G. Kannan
IBM Research
Jiaqi Gao
Jiaqi Gao
Alibaba Group
Yilong Zhao
Yilong Zhao
Ph.D. student, UC Berkeley
Computer SystemMicroarchitectureMachine Learning System
Yongji Wu
Yongji Wu
UC Berkeley
Machine Learning SystemsDatacenter Networks
Kaichao You
Kaichao You
PhD student at Tsinghua University
domain adaptationtransfer learningdeep learning
F
Fengyuan Ren
Tsinghua University
Z
Zhiying Xu
Amazon Web Services
C
C. Raiciu
University Pollitehnica of Bucharest & Broadcom
Ion Stoica
Ion Stoica
Professor of Computer Science, UC Berkeley
Cloud ComputingNetworkingDistributed SystemsBig Data