Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving: Practical Online Routing at Scale

📅 2026-05-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

237K/year
🤖 AI Summary
This work addresses the significant synchronization barrier delays and computational inefficiencies in large model inference caused by load imbalance under data parallelism (DP). To mitigate KV cache migration overhead and persistent skew induced by dynamic request patterns, the authors propose BalanceRoute—a family of online routing algorithms that dynamically assign requests to DP workers under millisecond-level scheduling constraints. Key innovations include BR-0, a prediction-free baseline; BR-H, which employs a short planning horizon; a piecewise-linear F-score with discounting to model load safety margins; and an integrated framework featuring two-stage scheduling, a lightweight termination classifier, and KV cache-aware modeling. Evaluated on a 144-NPU cluster, BalanceRoute substantially reduces DP load imbalance compared to vLLM and achieves higher end-to-end throughput on both Azure-2024 and production workloads.
📝 Abstract
Data-parallel (DP) load balancing has emerged as a first-order bottleneck in large-scale LLM serving. When a model is sharded across devices via tensor parallelism (TP) or expert parallelism (EP) and replicated across many DP workers, every decode step ends in a synchronization barrier whose latency is set by the most heavily loaded worker; even modest persistent imbalance across DP workers compounds, step after step, into a substantial fraction of wasted compute. The problem is hard for reasons specific to LLM decoding: assignments are sticky (migrating KV caches has a high cost), per-request loads grow over time, arrivals are non-stationary, and the router must decide within a sub-100\,ms decode budget over hundreds of waiting requests and tens of workers. We present \textbf{BalanceRoute}, a family of practical online routing algorithms that target this bottleneck. The first, \textbf{BR-0}, requires no prediction infrastructure and uses a piecewise-linear F-score that captures the sharp asymmetry between admissions that fill safe margin and those that overflow into the envelope; a two-stage decomposition keeps per-step cost compatible with millisecond-scale scheduling. The second, \textbf{BR-H}, generalizes BR-0 with a short, constant lookahead $H$ and a lightweight termination-classifier interface, extending the F-score to a horizon-discounted form. We deploy BalanceRoute on a 144-NPU cluster and evaluate against vLLM baselines on both a proprietary production trace and the public Azure-2024 trace. Across both workloads, BalanceRoute substantially reduces average DP imbalance and improves end-to-end serving throughput.
Problem

Research questions and friction points this paper is trying to address.

load balancing
large language models
data parallelism
LLM serving
synchronization bottleneck
Innovation

Methods, ideas, or system contributions that make the work stand out.

load balancing
online routing
LLM serving
data parallelism
F-score
T
Tianci Bu
Department of Industrial Engineering and Decision Analytics, HKUST
Y
Yuan Lyu
Theory Laboratory, Central Research Institute, 2012 Laboratories, Huawei
Z
Zixi Chen
School of Mathematical Sciences, Peking University
C
Chendong Song
Department of Industrial Engineering and Decision Analytics, HKUST
Hong Liang
Hong Liang
Aramco Americas
T
Tsepten Gurung
Department of Industrial Engineering and Decision Analytics, HKUST
Y
Yuwei Fan
Theory Laboratory, Central Research Institute, 2012 Laboratories, Huawei
Yinyu Ye
Yinyu Ye
Professor of Emeritus, Stanford University and Visiting Professor of SJTU, CUHKSZ and HKUST
Optimization - Operations Research - Mathematical Programming - Computational Science
Z
Zijie Zhou
Department of Industrial Engineering and Decision Analytics, HKUST