🤖 AI Summary
To address flow collisions and degraded collective communication performance in AI training caused by inflexible RDMA network transport, this paper proposes UCCL—a scalable GPU-network software transport layer. Our approach adopts a hardware-software co-designed, scalable architecture: it decouples data and control planes on RDMA NICs, migrating control logic to the CPU to enable flexible protocol evolution; and introduces a multipath scheduling mechanism with traffic-balancing algorithms to avoid single-path congestion and overcome inherent hardware limitations. Experimental evaluation demonstrates that UCCL achieves 3.3× higher ML collective communication throughput compared to state-of-the-art industrial solutions, significantly accelerating large-model training.
📝 Abstract
Fast-evolving machine learning (ML) workloads have increasing requirements for networking. However, host network transport on RDMA NICs is hard to evolve, causing problems for ML workloads. For example, single-path RDMA traffic is prone to flow collisions that severely degrade collective communication performance. We present UCCL, an extensible software transport layer to evolve GPU networking. UCCL decouples the data path and control path of existing RDMA NICs and efficiently runs the control-path transport on host CPUs. This software extensibility brings in transport innovations that cannot be achieved in hardware for ML workloads, e.g., a multipath transport to resolve flow collisions. ML collectives atop UCCL achieve up to 3.3x higher performance compared to an industry solution.