🤖 AI Summary
This work addresses the challenges of straggler nodes and high idle time in barrier-synchronized, stateful systems—such as large language model (LLM) inference—caused by workload heterogeneity and clock drift. The authors propose a general dynamic load balancing approach that models workload drift as a non-decreasing stochastic process and employs finite-horizon integer optimization to achieve efficient load distribution under the constraints that tasks are non-migratable and progress is bottlenecked by the slowest node. This study establishes the first general theoretical framework for load balancing in barrier-synchronized systems with worst-case performance guarantees. The benefits of the method amplify with increasing batch size and system scale. Experimental results demonstrate significant improvements in throughput, latency, and energy efficiency, reducing per-step decoding idle computation time by over 40% in real-world LLM serving scenarios.
📝 Abstract
Over 40% of computational power in Large Language Model (LLM) serving systems can be systematically wasted - not from hardware limits, but from load imbalance in barrier-synchronized parallel processing. When progress is gated by the slowest worker at each step, heterogeneous and evolving workloads create persistent stragglers; faster workers idle while drawing power, producing nothing. In large language model inference alone, this translates to gigawatt-hours of wasted electricity daily. Here we develop a universal load-balancing principle for barrier-synchronized systems with non-migratable state. We prove worst-case theoretical guarantees: imbalance reduction grows with system scale, and the resulting energy savings can exceed 52% for modern hardware at fleet scale. Experiments corroborate the theory, demonstrating 28% energy reduction alongside substantial throughput and latency improvements. Formulated as an online integer optimization with provable guarantees, the principle extends beyond LLM serving to broad classes of barrier-synchronized parallel systems, establishing a theoretical foundation for sustainable high-performance computing.