From Models to Operators: Rethinking Autoscaling Granularity for Large Generative Models

📅 2025-11-04

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

To address low resource utilization and significant performance fluctuations caused by static resource provisioning and model-level auto-scaling in large language model (LLM) serving, this paper proposes an operator-granularity elastic scaling framework. It pioneers the refinement of scaling units from the model level to the operator level, uncovering heterogeneous sensitivities across computation, memory, and workload characteristics. Leveraging operator-level execution graph analysis, the framework jointly optimizes batch scheduling, resource allocation, and deployment placement to enable adaptive scheduling under dynamic traffic. Evaluated on real production traces, it achieves—while meeting SLOs—a 40% reduction in GPU usage and a 35% decrease in energy consumption; alternatively, with fixed resources, it improves throughput by 1.6× and reduces energy consumption by 5%.

Technology Category

Application Category

📝 Abstract

Serving large generative models such as LLMs and multi- modal transformers requires balancing user-facing SLOs (e.g., time-to-first-token, time-between-tokens) with provider goals of efficiency and cost reduction. Existing solutions rely on static provisioning or model-level autoscaling, both of which treat the model as a monolith. This coarse-grained resource management leads to degraded performance or significant resource underutilization due to poor adaptability to dynamic inference traffic that is common online. The root cause of this inefficiency lies in the internal structure of generative models: they are executed as graphs of interconnected operators. Through detailed characterization and systematic analysis, we find that operators are heterogeneous in their compute and memory footprints and exhibit diverse sensitivity to workload and resource factors such as batch size, sequence length, and traffic rate. This heterogeneity suggests that the operator, rather than the entire model, is the right granularity for scaling decisions. We propose an operator-level autoscaling framework, which allocates resources at finer (operator)-granularity, optimizing the scaling, batching, and placement based on individual operator profiles. Evaluated on production-scale traces, our approach preserves SLOs with up to 40% fewer GPUs and 35% less energy, or under fixed resources achieves 1.6x higher throughput with 5% less energy. These results show that the operator, rather than the model, is fundamentally a more effective unit for scaling large generative workloads.

Problem

Research questions and friction points this paper is trying to address.

Optimizing autoscaling granularity for large generative model serving efficiency

Addressing resource underutilization from coarse-grained model-level scaling

Managing heterogeneous operator compute profiles within monolithic model structures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoscaling at operator granularity for generative models

Optimizing scaling based on individual operator profiles

Allocating resources at finer operator-level granularity

🔎 Similar Papers

Characterizing and Efficiently Accelerating Multimodal Generation Model Inference