🤖 AI Summary
Large language model (LLM) inference systems face significant challenges—including highly variable service times, dynamic KV cache memory consumption, and sensitivity to preemption policies—hindering analytical performance modeling and efficient scheduling.
Method: We propose the first queueing-theoretic framework tailored to LLM inference, featuring a prediction-augmented scheduler that explicitly models service time prediction errors and their nonlinear impact on latency and throughput. Our method integrates KV-cache-aware resource constraints with multi-policy preemption mechanisms.
Contribution/Results: We characterize critical trade-offs between prediction accuracy and system performance (throughput/latency), identify key bottleneck scenarios under realistic inference workloads, and formulate several open problems in prediction-sensitive scheduling. This work bridges classical queueing theory and modern LLM systems, establishing the first analytically tractable, empirically verifiable theoretical paradigm and algorithmic foundation for LLM cluster scheduling.
📝 Abstract
Queueing systems present many opportunities for applying machine-learning predictions, such as estimated service times, to improve system performance. This integration raises numerous open questions about how predictions can be effectively leveraged to improve scheduling decisions. Recent studies explore queues with predicted service times, typically aiming to minimize job time in the system. We review these works, highlight the effectiveness of predictions, and present open questions on queue performance. We then move to consider an important practical example of using predictions in scheduling, namely Large Language Model (LLM) systems, which presents novel scheduling challenges and highlights the potential for predictions to improve performance. In particular, we consider LLMs performing inference. Inference requests (jobs) in LLM systems are inherently complex; they have variable inference times, dynamic memory footprints that are constrained by key-value (KV) store memory limitations, and multiple possible preemption approaches that affect performance differently. We provide background on the important aspects of scheduling in LLM systems, and introduce new models and open problems that arise from them. We argue that there are significant opportunities for applying insights and analysis from queueing theory to scheduling in LLM systems.