🤖 AI Summary
To address response latency, rigid thresholding, and lack of GPU-level metric awareness in Kubernetes-based auto-scaling for GPU inference workloads, this paper proposes KIScaler—the first intelligent auto-scaling framework integrating GPU-aware simulation with reinforcement learning. KIScaler employs KISim, a high-fidelity GPU resource simulator, to model real hardware behavior and uses Proximal Policy Optimization (PPO) to learn end-to-end scaling policies. It introduces a composite reward function jointly considering GPU utilization, GPU memory occupancy, and request latency. Crucially, KIScaler achieves zero-shot generalization across diverse traffic patterns—bursty, periodic, stepwise, and stochastic—without retraining. Experiments demonstrate a 75.2% average reward improvement over baselines, and up to 6.7× reduction in P95 latency compared to CPU-based scaling, significantly enhancing both resource efficiency and QoS guarantees.
📝 Abstract
Autoscaling GPU inference workloads in Kubernetes remains challenging due to the reactive and threshold-based nature of default mechanisms such as the Horizontal Pod Autoscaler (HPA), which struggle under dynamic and bursty traffic patterns and lack integration with GPU-level metrics. We present KIS-S, a unified framework that combines KISim, a GPU-aware Kubernetes Inference Simulator, with KIScaler, a Proximal Policy Optimization (PPO)-based autoscaler. KIScaler learns latency-aware and resource-efficient scaling policies entirely in simulation, and is directly deployed without retraining. Experiments across four traffic patterns show that KIScaler improves average reward by 75.2%, reduces P95 latency up to 6.7x over CPU baselines, and generalizes without retraining. Our work bridges the gap between reactive autoscaling and intelligent orchestration for scalable GPU-accelerated environments.