Flow-Controlled Scheduling for LLM Inference with Provable Stability Guarantees

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

This work addresses the instability in large language model (LLM) inference caused by unpredictable decoding lengths, which lead to uncontrolled memory consumption in key-value (KV) cache and potential system failure. To tackle this challenge, the paper establishes the first theoretical stability bound for LLM inference scheduling and introduces a flow-control-based scheduling framework. By regulating the rate at which prompts enter the active set and integrating queueing theory, dynamic resource control, KV cache management, and request admission policies, the framework achieves provably stable inference. Experimental results demonstrate that the proposed approach significantly improves both request and token throughput, reduces average and tail latency, and maintains more consistent KV cache utilization.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have been widely adopted due to their great performance across a wide range of applications. ChatGPT and Gemini now serve hundreds of millions of active users and handle billions of user requests per day, which puts optimizing LLM inference into the spotlight. A key challenge in LLM inference is that decode lengths are unknown. The memory usage for each request grows with generated tokens, which may lead to overflow and cause system instability. To address this concern, we propose a simple flow-control framework that controls the rate at which prompts join the active set. We derive a necessary condition that any stable system must satisfy and establish sufficient conditions under which our algorithm provably achieves stability. Experiments show that, compared to commonly used strategies in practice, our approach achieves higher token and request throughput, lower average and tail latency, and more stable KV cache utilization.

Problem

Research questions and friction points this paper is trying to address.

LLM inference

decode length uncertainty

memory overflow

system instability

KV cache

Innovation

Methods, ideas, or system contributions that make the work stand out.

flow-controlled scheduling

LLM inference

stability guarantees