SageSched: Efficient LLM Scheduling Confronting Demand Uncertainty and Hybridity

📅 2026-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the dual challenges in large language model (LLM) inference scheduling: the inherent uncertainty in output lengths and the joint pressure on computational and memory resources. Existing approaches suffer from inefficiency due to reliance on simplistic heuristics or neglect of memory bottlenecks. To overcome these limitations, we propose SageSched, the first scheduler that jointly models output length distribution and hybrid resource consumption. SageSched employs a lightweight mechanism that integrates prompt content with historical inference outcomes to predict output length distributions, constructs a service cost model that accounts for both computation and memory demands, and devises an uncertainty-aware scheduling policy. Evaluations on a real-system implementation demonstrate that SageSched improves inference efficiency by over 28.7% on average across diverse configurations compared to state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
Efficient LLM inference scheduling is crucial for user experience.However, LLM inferences exhibit remarkable demand uncertainty (with unknown output length beforehand) and hybridity (being both compute and memory intensive). Existing LLM schedulers rely on simple heuristics or focus purely on compute resource, suffering suboptimal performance. In this work, we propose SageSched, an efficient LLM scheduler that properly handles demand uncertainty and hybridity of inference workloads.SageSched combines prompt contents with the past inference results to predict output-length distribution in a light-weight and also accurate manner.Meanwhile, it models the true service cost of an inference request with both compute and memory aspects considered.Finally, SageSched employs an uncertainty-aware scheduling policy that can yield the best overall efficiency given the request cost distributions.Testbed experiments over diverse setups confirm that SageSched can attain an efficiency improvement of over 28.7%.
Problem

Research questions and friction points this paper is trying to address.

demand uncertainty
hybridity
LLM inference scheduling
resource efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM scheduling
demand uncertainty
hybrid resource modeling
output-length prediction
uncertainty-aware scheduling
Z
Zhenghao Gan
Shanghai Jiao Tong University
Y
Yichen Bao
Shanghai Jiao Tong University
Y
Yifei Liu
Shanghai Jiao Tong University
C
Chen Chen
Shanghai Jiao Tong University
Quan Chen
Quan Chen
Professor, Shanghai Jiao Tong University
Parallel Computing
Minyi Guo
Minyi Guo
IEEE Fellow, Chair Professor, Shanghai Jiao Tong University
Parallel ComputingCompiler OptimizationCloud ComputingNetworkingBig Data