🤖 AI Summary
Large-scale AI training on specialized infrastructure suffers from load imbalance and low resource utilization due to heterogeneous traffic patterns. To address this, we propose a joint optimization framework for load balancing tailored to distributed training traffic characteristics: it tightly integrates network-layer congestion control with transport-layer packet-loss recovery into the load scheduling policy, enabling dynamic adaptation of request dispatching to real-time link states. Through systematic evaluation of multiple load-balancing algorithms paired with transport protocols under high-concurrency AI training workloads, we identify the optimal LB-protocol co-design. Experimental results demonstrate that our approach improves system throughput by 32% and reduces end-to-end latency by 27% over baseline methods, significantly enhancing GPU cluster resource utilization and training stability.
📝 Abstract
We investigate the performance of various load balancing algorithms for large-scale AI training workloads that are running on dedicated infrastructure. The performance of load balancing depends on both the congestion control and loss recovery algorithms, so our evaluation also sheds light on the appropriate choices for those designs as well.