Serving Models, Fast and Slow:Optimizing Heterogeneous LLM Inferencing Workloads at Scale

📅 2025-02-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Modern cloud environments host heterogeneous large language model (LLM) inference workloads—comprising both latency-sensitive and latency-tolerant requests—posing significant challenges in resource efficiency, cost control, and SLA/SLO compliance. Method: This paper proposes SAGESERVE, the first cross-time-scale adaptive scheduling framework that jointly optimizes millisecond-level request routing with minute- to hour-level model warm migration, GPU elastic orchestration, and VM autoscaling. Unlike conventional task isolation approaches, SAGESERVE employs integer linear programming (ILP) modeling and real-traffic-driven simulation, evaluated on an 8-million-request production trace. Results: SAGESERVE reduces GPU-hour consumption by 25%, cuts autoscaling overhead by 80%, and achieves $2M monthly cost savings—all while strictly satisfying all SLA/SLO constraints. Its core contribution lies in unifying multi-model, multi-hardware, and multi-region GPU resource scheduling, thereby substantially improving utilization of costly GPU infrastructure.

Technology Category

Application Category

📝 Abstract
Large Language Model (LLM) inference workloads handled by global cloud providers can include both latency-sensitive and insensitive tasks, creating a diverse range of Service Level Agreement (SLA) requirements. Managing these mixed workloads is challenging due to the complexity of the inference stack, which includes multiple LLMs, hardware configurations, and geographic distributions. Current optimization strategies often silo these tasks to ensure that SLAs are met for latency-sensitive tasks, but this leads to significant under-utilization of expensive GPU resources despite the availability of spot and on-demand Virtual Machine (VM) provisioning. We propose SAGESERVE, a comprehensive LLM serving framework that employs adaptive control knobs at varying time scales, ensuring SLA compliance while maximizing the utilization of valuable GPU resources. Short-term optimizations include efficient request routing to data center regions, while long-term strategies involve scaling GPU VMs out/in and redeploying models to existing VMs to align with traffic patterns. These strategies are formulated as an optimization problem for resource allocation and solved using Integer Linear Programming (ILP). We perform empirical and simulation studies based on production workload traces with over 8M requests using four open-source models deployed across three regions. SAGESERVE achieves up to 25% savings in GPU-hours while maintaining tail latency and satisfying all SLOs, and it reduces the scaling overhead compared to baselines by up to 80%, confirming the effectiveness of our proposal. In terms of dollar cost, this can save cloud providers up to $2M over the course of a month.
Problem

Research questions and friction points this paper is trying to address.

Optimizing mixed latency-sensitive LLM workloads
Maximizing GPU resource utilization
Reducing cloud provider costs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive control knobs
Integer Linear Programming
Efficient request routing
🔎 Similar Papers
No similar papers found.