A Universal Load Balancing Principle and Its Application to Large Language Model Serving

📅 2026-01-25
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of straggler nodes and high idle time in barrier-synchronized, stateful systems—such as large language model (LLM) inference—caused by workload heterogeneity and clock drift. The authors propose a general dynamic load balancing approach that models workload drift as a non-decreasing stochastic process and employs finite-horizon integer optimization to achieve efficient load distribution under the constraints that tasks are non-migratable and progress is bottlenecked by the slowest node. This study establishes the first general theoretical framework for load balancing in barrier-synchronized systems with worst-case performance guarantees. The benefits of the method amplify with increasing batch size and system scale. Experimental results demonstrate significant improvements in throughput, latency, and energy efficiency, reducing per-step decoding idle computation time by over 40% in real-world LLM serving scenarios.

Technology Category

Application Category

📝 Abstract
Over 40% of computational power in Large Language Model (LLM) serving systems can be systematically wasted - not from hardware limits, but from load imbalance in barrier-synchronized parallel processing. When progress is gated by the slowest worker at each step, heterogeneous and evolving workloads create persistent stragglers; faster workers idle while drawing power, producing nothing. In large language model inference alone, this translates to gigawatt-hours of wasted electricity daily. Here we develop a universal load-balancing principle for barrier-synchronized systems with non-migratable state. We prove worst-case theoretical guarantees: imbalance reduction grows with system scale, and the resulting energy savings can exceed 52% for modern hardware at fleet scale. Experiments corroborate the theory, demonstrating 28% energy reduction alongside substantial throughput and latency improvements. Formulated as an online integer optimization with provable guarantees, the principle extends beyond LLM serving to broad classes of barrier-synchronized parallel systems, establishing a theoretical foundation for sustainable high-performance computing.
Problem

Research questions and friction points this paper is trying to address.

load balancing
large language model serving
barrier synchronization
stragglers
workload heterogeneity
Innovation

Methods, ideas, or system contributions that make the work stand out.

load balancing
barrier synchronization
large language model serving
integer optimization
workload drift
🔎 Similar Papers
No similar papers found.
Z
Zixi Chen
School of Mathematical Sciences, Peking University, Yiheyuan Road, 100871, Beijing, China.
T
Tianci Bu
Department of System Engineering, National University of Defense Technology, 109 Deya Road, Changsha, 410073, Hunan, China.
C
Chendong Song
Department of Industrial Engineering and Decision Analytics, HKUST, Clear Water Bay, Hongkong, China.
Xin Lu
Xin Lu
College of Systems Engineering, National University of Defense Technology
Big DataNature DisasterMobile PhoneHuman BehaviorComplex Networks
Yinyu Ye
Yinyu Ye
Professor of Emeritus, Stanford University and Visiting Professor of SJTU, CUHKSZ and HKUST
Optimization - Operations Research - Mathematical Programming - Computational Science
Z
Zijie Zhou
Department of Industrial Engineering and Decision Analytics, HKUST, Clear Water Bay, Hongkong, China.