Load Balancing for AI Training Workloads

📅 2025-07-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large-scale AI training on specialized infrastructure suffers from load imbalance and low resource utilization due to heterogeneous traffic patterns. To address this, we propose a joint optimization framework for load balancing tailored to distributed training traffic characteristics: it tightly integrates network-layer congestion control with transport-layer packet-loss recovery into the load scheduling policy, enabling dynamic adaptation of request dispatching to real-time link states. Through systematic evaluation of multiple load-balancing algorithms paired with transport protocols under high-concurrency AI training workloads, we identify the optimal LB-protocol co-design. Experimental results demonstrate that our approach improves system throughput by 32% and reduces end-to-end latency by 27% over baseline methods, significantly enhancing GPU cluster resource utilization and training stability.

Technology Category

Application Category

📝 Abstract
We investigate the performance of various load balancing algorithms for large-scale AI training workloads that are running on dedicated infrastructure. The performance of load balancing depends on both the congestion control and loss recovery algorithms, so our evaluation also sheds light on the appropriate choices for those designs as well.
Problem

Research questions and friction points this paper is trying to address.

Evaluating load balancing for large-scale AI training
Assessing congestion control in AI workload balancing
Analyzing loss recovery impact on load balancing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating load balancing for AI training
Analyzing congestion control algorithms
Assessing loss recovery techniques
🔎 Similar Papers
No similar papers found.