An Extensible Software Transport Layer for GPU Networking

📅 2025-04-24

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

To address flow collisions and degraded collective communication performance in AI training caused by inflexible RDMA network transport, this paper proposes UCCL—a scalable GPU-network software transport layer. Our approach adopts a hardware-software co-designed, scalable architecture: it decouples data and control planes on RDMA NICs, migrating control logic to the CPU to enable flexible protocol evolution; and introduces a multipath scheduling mechanism with traffic-balancing algorithms to avoid single-path congestion and overcome inherent hardware limitations. Experimental evaluation demonstrates that UCCL achieves 3.3× higher ML collective communication throughput compared to state-of-the-art industrial solutions, significantly accelerating large-model training.

Technology Category

Application Category

📝 Abstract

Fast-evolving machine learning (ML) workloads have increasing requirements for networking. However, host network transport on RDMA NICs is hard to evolve, causing problems for ML workloads. For example, single-path RDMA traffic is prone to flow collisions that severely degrade collective communication performance. We present UCCL, an extensible software transport layer to evolve GPU networking. UCCL decouples the data path and control path of existing RDMA NICs and efficiently runs the control-path transport on host CPUs. This software extensibility brings in transport innovations that cannot be achieved in hardware for ML workloads, e.g., a multipath transport to resolve flow collisions. ML collectives atop UCCL achieve up to 3.3x higher performance compared to an industry solution.

Problem

Research questions and friction points this paper is trying to address.

Addresses networking requirements for fast-evolving ML workloads

Resolves flow collisions in single-path RDMA traffic

Enhances ML collective communication performance via software transport

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extensible software transport layer UCCL

Decouples RDMA NIC data and control paths

Multipath transport resolves flow collisions

🔎 Similar Papers

No similar papers found.