Large-Scale LLM Inference with Heterogeneous Workloads: Prefill-Decode Contention and Asymptotically Optimal Control

๐Ÿ“… 2026-02-03
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the performance bottleneck in large language model (LLM) inference caused by heterogeneous workloads when the prefill and decode stages share GPU resources. The authors propose a stochastic controlโ€“based scheduling framework that models LLM inference as a multi-class, multi-server queueing network with state-dependent service rates. By leveraging fluid approximations and steady-state linear programming, they derive an asymptotically optimal gated routing policy. This policy supports both bundled and disaggregated pricing models in large-scale GPU clusters and is extensible to accommodate service-level agreement (SLA) constraints such as latency and fairness. Experimental results based on real-world iteration times demonstrate that the proposed approach significantly outperforms existing heuristic scheduling strategies.

Technology Category

Application Category

๐Ÿ“ Abstract
Large Language Models (LLMs) are rapidly becoming critical infrastructure for enterprise applications, driving unprecedented demand for GPU-based inference services. A key operational challenge arises from the two-phase nature of LLM inference: a compute-intensive \emph{prefill} phase that processes user input, followed by a memory-bound \emph{decode} phase that generates output tokens. When these phases share GPU resources, prefill tasks throttle the processing speed of concurrent decodes, creating state-dependent contention. This contention is further complicated by workload heterogeneity, as different applications exhibit vastly different input and output lengths. We develop a stochastic control framework for scheduling heterogeneous LLM workloads across large GPU clusters. We formulate LLM inference as a multiclass many-server queueing network with state-dependent service rates, grounded in empirical iteration-time measurements. We analyze the fluid approximation of this system and solve steady-state linear programs that characterize optimal resource allocation. We design gate-and-route policies that regulate prefill admission and decode routing, and prove that they are asymptotically optimal in the many-GPU limit under both bundled and separate token-pricing schemes. We further extend the framework to incorporate Service Level Indicators (SLIs) such as latency and fairness, providing a general approach to constrained scheduling. Numerical experiments calibrated to empirical iteration-time data demonstrate that our policies outperform standard serving heuristics.
Problem

Research questions and friction points this paper is trying to address.

LLM inference
prefill-decode contention
heterogeneous workloads
resource scheduling
GPU clusters
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM inference
prefill-decode contention
stochastic control
asymptotically optimal scheduling
heterogeneous workloads
๐Ÿ”Ž Similar Papers
No similar papers found.
R
Ruihan Lin
Department of Industrial Engineering and Decision Analytics, The Hong Kong University of Science and Technology
Z
Zean Han
Department of Industrial Engineering and Decision Analytics, The Hong Kong University of Science and Technology
Z
Zezhen Ding
Department of Industrial Engineering and Decision Analytics, The Hong Kong University of Science and Technology
Jiheng Zhang
Jiheng Zhang
The Hong Kong University of Science and Technology
Applied ProbabilityStochastic Modeling and OptimizationNumerical Methods and Algorithm