Optimal Scheduling Algorithms for LLM Inference: Theory and Practice

📅 2025-08-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language model (LLM) inference exhibits intrinsic computational heterogeneity between the prefill and decoding stages, posing significant challenges for request routing and scheduling. Method: This paper introduces the first theoretical scheduling framework explicitly designed for this two-stage characteristic. It proposes RAD, a throughput-optimal scheduler leveraging optimal chunking and dynamic resource allocation; and SLAI, the first scheduler enabling fine-grained, differentiated service-level objectives for time-to-first-token (TTFT) and inter-token latency (ITL). SLAI integrates prompt-length-aware reordering, real-time resource-aware scheduling, dynamic priority assignment, and a closed-loop optimization mechanism guided by production system feedback. Results: Evaluated on Mistral-7B, SLAI reduces median TTFT by 53% and increases maximum service capacity by 26% over Sarathi-Serve, while strictly satisfying tail-latency constraints.

Technology Category

Application Category

📝 Abstract
With the growing use of Large Language Model (LLM)-based tools like ChatGPT, Perplexity, and Gemini across industries, there is a rising need for efficient LLM inference systems. These systems handle requests with a unique two-phase computation structure: a prefill-phase that processes the full input prompt and a decode-phase that autoregressively generates tokens one at a time. This structure calls for new strategies for routing and scheduling requests. In this paper, we take a comprehensive approach to this challenge by developing a theoretical framework that models routing and scheduling in LLM inference systems. We identify two key design principles-optimal tiling and dynamic resource allocation-that are essential for achieving high throughput. Guided by these principles, we propose the Resource-Aware Dynamic (RAD) scheduler and prove that it achieves throughput optimality under mild conditions. To address practical Service Level Objectives (SLOs) such as serving requests with different Time Between Token (TBT) constraints, we design the SLO-Aware LLM Inference (SLAI) scheduler. SLAI uses real-time measurements to prioritize decode requests that are close to missing their TBT deadlines and reorders prefill requests based on known prompt lengths to further reduce the Time To First Token (TTFT) delays. We evaluate SLAI on the Openchat ShareGPT4 dataset using the Mistral-7B model on an NVIDIA RTX ADA 6000 GPU. Compared to Sarathi-Serve, SLAI reduces the median TTFT by 53% and increases the maximum serving capacity by 26% such that median TTFT is below 0.5 seconds, while meeting tail TBT latency constraints.
Problem

Research questions and friction points this paper is trying to address.

Develop efficient scheduling for LLM two-phase inference.
Optimize throughput via tiling and dynamic resource allocation.
Meet SLOs like TBT and TTFT in LLM serving.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimal tiling for high throughput
Dynamic resource allocation strategy
SLO-aware scheduler for latency
A
Agrim Bari
The University of Texas at Austin, USA
P
Parikshit Hegde
The University of Texas at Austin, USA
Gustavo de Veciana
Gustavo de Veciana
Professor of Electrical and Computer Engineering, U.T. Austin
Communication SystemsNetworksPerformance