KIS-S: A GPU-Aware Kubernetes Inference Simulator with RL-Based Auto-Scaling

📅 2025-07-10

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

To address response latency, rigid thresholding, and lack of GPU-level metric awareness in Kubernetes-based auto-scaling for GPU inference workloads, this paper proposes KIScaler—the first intelligent auto-scaling framework integrating GPU-aware simulation with reinforcement learning. KIScaler employs KISim, a high-fidelity GPU resource simulator, to model real hardware behavior and uses Proximal Policy Optimization (PPO) to learn end-to-end scaling policies. It introduces a composite reward function jointly considering GPU utilization, GPU memory occupancy, and request latency. Crucially, KIScaler achieves zero-shot generalization across diverse traffic patterns—bursty, periodic, stepwise, and stochastic—without retraining. Experiments demonstrate a 75.2% average reward improvement over baselines, and up to 6.7× reduction in P95 latency compared to CPU-based scaling, significantly enhancing both resource efficiency and QoS guarantees.

Technology Category

Application Category

📝 Abstract

Autoscaling GPU inference workloads in Kubernetes remains challenging due to the reactive and threshold-based nature of default mechanisms such as the Horizontal Pod Autoscaler (HPA), which struggle under dynamic and bursty traffic patterns and lack integration with GPU-level metrics. We present KIS-S, a unified framework that combines KISim, a GPU-aware Kubernetes Inference Simulator, with KIScaler, a Proximal Policy Optimization (PPO)-based autoscaler. KIScaler learns latency-aware and resource-efficient scaling policies entirely in simulation, and is directly deployed without retraining. Experiments across four traffic patterns show that KIScaler improves average reward by 75.2%, reduces P95 latency up to 6.7x over CPU baselines, and generalizes without retraining. Our work bridges the gap between reactive autoscaling and intelligent orchestration for scalable GPU-accelerated environments.

Problem

Research questions and friction points this paper is trying to address.

Autoscaling GPU workloads in Kubernetes reactively

Lack of GPU metrics in current scaling mechanisms

Dynamic traffic challenges for inference workloads

Innovation

Methods, ideas, or system contributions that make the work stand out.

GPU-aware Kubernetes Inference Simulator

PPO-based autoscaler for GPU workloads

Simulation-trained latency-aware scaling policies

🔎 Similar Papers

CloudNativeSim: a toolkit for modeling and simulation of cloud-native applications