Hierarchical Autoscaling for Large Language Model Serving with Chiron

πŸ“… 2025-01-14
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current LLM cloud auto-scaling systems overlook the distinct SLO requirements of interactive versus batch inference requests, leading to SLO violations in latency and inefficient GPU resource utilization. This paper proposes a hierarchical adaptive scaling framework: an upper layer implements backpressure control by jointly estimating queue length, GPU utilization, and SLOβ€”marking the first explicit integration of SLO modeling into scaling decisions; a lower layer jointly optimizes GPU instance count and batch size. The method synthesizes hierarchical control theory, real-time queue monitoring, and SLO-aware feedback mechanisms. Experiments demonstrate up to a 90% improvement in SLO compliance rate, a 70% increase in GPU utilization, and significant reductions in latency violations and redundant resource overhead.

Technology Category

Application Category

πŸ“ Abstract
Large language model (LLM) serving is becoming an increasingly important workload for cloud providers. Based on performance SLO requirements, LLM inference requests can be divided into (a) interactive requests that have tight SLOs in the order of seconds, and (b) batch requests that have relaxed SLO in the order of minutes to hours. These SLOs can degrade based on the arrival rates, multiplexing, and configuration parameters, thus necessitating the use of resource autoscaling on serving instances and their batch sizes. However, previous autoscalers for LLM serving do not consider request SLOs leading to unnecessary scaling and resource under-utilization. To address these limitations, we introduce Chiron, an autoscaler that uses the idea of hierarchical backpressure estimated using queue size, utilization, and SLOs. Our experiments show that Chiron achieves up to 90% higher SLO attainment and improves GPU efficiency by up to 70% compared to existing solutions.
Problem

Research questions and friction points this paper is trying to address.

Resource Allocation
Large Language Models
Response Time Optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Resource Adjustment
Large Language Models
Cloud Service Efficiency
πŸ”Ž Similar Papers