Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving: Practical Online Routing at Scale

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the significant synchronization barrier delays and computational inefficiencies in large model inference caused by load imbalance under data parallelism (DP). To mitigate KV cache migration overhead and persistent skew induced by dynamic request patterns, the authors propose BalanceRoute—a family of online routing algorithms that dynamically assign requests to DP workers under millisecond-level scheduling constraints. Key innovations include BR-0, a prediction-free baseline; BR-H, which employs a short planning horizon; a piecewise-linear F-score with discounting to model load safety margins; and an integrated framework featuring two-stage scheduling, a lightweight termination classifier, and KV cache-aware modeling. Evaluated on a 144-NPU cluster, BalanceRoute substantially reduces DP load imbalance compared to vLLM and achieves higher end-to-end throughput on both Azure-2024 and production workloads.

📝 Abstract

Data-parallel (DP) load balancing has emerged as a first-order bottleneck in large-scale LLM serving. When a model is sharded across devices via tensor parallelism (TP) or expert parallelism (EP) and replicated across many DP workers, every decode step ends in a synchronization barrier whose latency is set by the most heavily loaded worker; even modest persistent imbalance across DP workers compounds, step after step, into a substantial fraction of wasted compute. The problem is hard for reasons specific to LLM decoding: assignments are sticky (migrating KV caches has a high cost), per-request loads grow over time, arrivals are non-stationary, and the router must decide within a sub-100\,ms decode budget over hundreds of waiting requests and tens of workers. We present \textbf{BalanceRoute}, a family of practical online routing algorithms that target this bottleneck. The first, \textbf{BR-0}, requires no prediction infrastructure and uses a piecewise-linear F-score that captures the sharp asymmetry between admissions that fill safe margin and those that overflow into the envelope; a two-stage decomposition keeps per-step cost compatible with millisecond-scale scheduling. The second, \textbf{BR-H}, generalizes BR-0 with a short, constant lookahead $H$ and a lightweight termination-classifier interface, extending the F-score to a horizon-discounted form. We deploy BalanceRoute on a 144-NPU cluster and evaluate against vLLM baselines on both a proprietary production trace and the public Azure-2024 trace. Across both workloads, BalanceRoute substantially reduces average DP imbalance and improves end-to-end serving throughput.

Problem

Research questions and friction points this paper is trying to address.

load balancing

large language models

data parallelism

LLM serving

synchronization bottleneck

Innovation

Methods, ideas, or system contributions that make the work stand out.

load balancing

online routing

LLM serving