Large-Scale LLM Inference with Heterogeneous Workloads: Prefill-Decode Contention and Asymptotically Optimal Control

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

This work addresses the performance bottleneck in large language model (LLM) inference caused by heterogeneous workloads when the prefill and decode stages share GPU resources. The authors propose a stochastic control–based scheduling framework that models LLM inference as a multi-class, multi-server queueing network with state-dependent service rates. By leveraging fluid approximations and steady-state linear programming, they derive an asymptotically optimal gated routing policy. This policy supports both bundled and disaggregated pricing models in large-scale GPU clusters and is extensible to accommodate service-level agreement (SLA) constraints such as latency and fairness. Experimental results based on real-world iteration times demonstrate that the proposed approach significantly outperforms existing heuristic scheduling strategies.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are rapidly becoming critical infrastructure for enterprise applications, driving unprecedented demand for GPU-based inference services. A key operational challenge arises from the two-phase nature of LLM inference: a compute-intensive \emph{prefill} phase that processes user input, followed by a memory-bound \emph{decode} phase that generates output tokens. When these phases share GPU resources, prefill tasks throttle the processing speed of concurrent decodes, creating state-dependent contention. This contention is further complicated by workload heterogeneity, as different applications exhibit vastly different input and output lengths. We develop a stochastic control framework for scheduling heterogeneous LLM workloads across large GPU clusters. We formulate LLM inference as a multiclass many-server queueing network with state-dependent service rates, grounded in empirical iteration-time measurements. We analyze the fluid approximation of this system and solve steady-state linear programs that characterize optimal resource allocation. We design gate-and-route policies that regulate prefill admission and decode routing, and prove that they are asymptotically optimal in the many-GPU limit under both bundled and separate token-pricing schemes. We further extend the framework to incorporate Service Level Indicators (SLIs) such as latency and fairness, providing a general approach to constrained scheduling. Numerical experiments calibrated to empirical iteration-time data demonstrate that our policies outperform standard serving heuristics.

Problem

Research questions and friction points this paper is trying to address.

LLM inference

prefill-decode contention

heterogeneous workloads

resource scheduling

GPU clusters

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM inference

prefill-decode contention

stochastic control