Past-Future Scheduler for LLM Serving under SLA Guarantees

📅 2025-07-14

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Existing schedulers for LLM inference services suffer from inaccurate memory estimation due to highly variable output lengths, leading to either overly aggressive batching (causing unnecessary request evictions) or excessive conservatism (increasing queueing latency), thereby failing to meet strict SLA constraints while maximizing goodput. Method: We propose a dynamic batch scheduling mechanism that jointly leverages historical output-length statistics and fine-grained temporal memory trajectory prediction. It models the empirical output-length distribution and precisely estimates per-batch peak memory consumption during execution. Contribution/Results: Integrated into our high-performance inference framework LightLLM, the scheduler adaptively balances queueing delay and eviction across diverse input–output workloads. Experiments under high-load, SLA-sensitive settings show a 2–3× improvement in goodput over state-of-the-art schedulers, demonstrating both effectiveness and practicality.

Technology Category

Application Category

📝 Abstract

The exploration and application of Large Language Models (LLMs) is thriving. To reduce deployment costs, continuous batching has become an essential feature in current service frameworks. The effectiveness of continuous batching relies on an accurate estimate of the memory requirements of requests. However, due to the diversity in request output lengths, existing frameworks tend to adopt aggressive or conservative schedulers, which often result in significant overestimation or underestimation of memory consumption. Consequently, they suffer from harmful request evictions or prolonged queuing times, failing to achieve satisfactory throughput under strict Service Level Agreement (SLA) guarantees (a.k.a. goodput), across various LLM application scenarios with differing input-output length distributions. To address this issue, we propose a novel Past-Future scheduler that precisely estimates the peak memory resources required by the running batch via considering the historical distribution of request output lengths and calculating memory occupancy at each future time point. It adapts to applications with all types of input-output length distributions, balancing the trade-off between request queuing and harmful evictions, thereby consistently achieving better goodput. Furthermore, to validate the effectiveness of the proposed scheduler, we developed a high-performance LLM serving framework, LightLLM, that implements the Past-Future scheduler. Compared to existing aggressive or conservative schedulers, LightLLM demonstrates superior goodput, achieving up to 2-3$ imes$ higher goodput than other schedulers under heavy loads. LightLLM is open source to boost the research in such direction (https://github.com/ModelTC/lightllm).

Problem

Research questions and friction points this paper is trying to address.

Accurate memory estimation for LLM requests

Balancing queuing and eviction in scheduling

Improving throughput under strict SLA guarantees

Innovation

Methods, ideas, or system contributions that make the work stand out.

Past-Future scheduler estimates peak memory precisely

Adapts to all input-output length distributions

LightLLM framework achieves higher goodput

🔎 Similar Papers

Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Load Balancing