Hierarchical Autoscaling for Large Language Model Serving with Chiron

📅 2025-01-14

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Current LLM cloud auto-scaling systems overlook the distinct SLO requirements of interactive versus batch inference requests, leading to SLO violations in latency and inefficient GPU resource utilization. This paper proposes a hierarchical adaptive scaling framework: an upper layer implements backpressure control by jointly estimating queue length, GPU utilization, and SLO—marking the first explicit integration of SLO modeling into scaling decisions; a lower layer jointly optimizes GPU instance count and batch size. The method synthesizes hierarchical control theory, real-time queue monitoring, and SLO-aware feedback mechanisms. Experiments demonstrate up to a 90% improvement in SLO compliance rate, a 70% increase in GPU utilization, and significant reductions in latency violations and redundant resource overhead.

Technology Category

Application Category

📝 Abstract

Large language model (LLM) serving is becoming an increasingly important workload for cloud providers. Based on performance SLO requirements, LLM inference requests can be divided into (a) interactive requests that have tight SLOs in the order of seconds, and (b) batch requests that have relaxed SLO in the order of minutes to hours. These SLOs can degrade based on the arrival rates, multiplexing, and configuration parameters, thus necessitating the use of resource autoscaling on serving instances and their batch sizes. However, previous autoscalers for LLM serving do not consider request SLOs leading to unnecessary scaling and resource under-utilization. To address these limitations, we introduce Chiron, an autoscaler that uses the idea of hierarchical backpressure estimated using queue size, utilization, and SLOs. Our experiments show that Chiron achieves up to 90% higher SLO attainment and improves GPU efficiency by up to 70% compared to existing solutions.

Problem

Research questions and friction points this paper is trying to address.

Resource Allocation

Large Language Models

Response Time Optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Resource Adjustment

Large Language Models

Cloud Service Efficiency

🔎 Similar Papers

Towards Pareto Optimal Throughput in Small Language Model Serving