KIS-S: A GPU-Aware Kubernetes Inference Simulator with RL-Based Auto-Scaling

📅 2025-07-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address response latency, rigid thresholding, and lack of GPU-level metric awareness in Kubernetes-based auto-scaling for GPU inference workloads, this paper proposes KIScaler—the first intelligent auto-scaling framework integrating GPU-aware simulation with reinforcement learning. KIScaler employs KISim, a high-fidelity GPU resource simulator, to model real hardware behavior and uses Proximal Policy Optimization (PPO) to learn end-to-end scaling policies. It introduces a composite reward function jointly considering GPU utilization, GPU memory occupancy, and request latency. Crucially, KIScaler achieves zero-shot generalization across diverse traffic patterns—bursty, periodic, stepwise, and stochastic—without retraining. Experiments demonstrate a 75.2% average reward improvement over baselines, and up to 6.7× reduction in P95 latency compared to CPU-based scaling, significantly enhancing both resource efficiency and QoS guarantees.

Technology Category

Application Category

📝 Abstract
Autoscaling GPU inference workloads in Kubernetes remains challenging due to the reactive and threshold-based nature of default mechanisms such as the Horizontal Pod Autoscaler (HPA), which struggle under dynamic and bursty traffic patterns and lack integration with GPU-level metrics. We present KIS-S, a unified framework that combines KISim, a GPU-aware Kubernetes Inference Simulator, with KIScaler, a Proximal Policy Optimization (PPO)-based autoscaler. KIScaler learns latency-aware and resource-efficient scaling policies entirely in simulation, and is directly deployed without retraining. Experiments across four traffic patterns show that KIScaler improves average reward by 75.2%, reduces P95 latency up to 6.7x over CPU baselines, and generalizes without retraining. Our work bridges the gap between reactive autoscaling and intelligent orchestration for scalable GPU-accelerated environments.
Problem

Research questions and friction points this paper is trying to address.

Autoscaling GPU workloads in Kubernetes reactively
Lack of GPU metrics in current scaling mechanisms
Dynamic traffic challenges for inference workloads
Innovation

Methods, ideas, or system contributions that make the work stand out.

GPU-aware Kubernetes Inference Simulator
PPO-based autoscaler for GPU workloads
Simulation-trained latency-aware scaling policies
🔎 Similar Papers
No similar papers found.
G
Guilin Zhang
Department of Engineering Management and Systems Engineering, George Washington University, USA
W
Wulan Guo
Department of Engineering Management and Systems Engineering, George Washington University, USA
Z
Ziqi Tan
Department of Engineering Management and Systems Engineering, George Washington University, USA
Qiang Guan
Qiang Guan
Kent State University
Dependability and Reliability AnalysisQuantum Computing SystemsHigh Performance ComputingFailure Detection and Diagnosis i
Hailong Jiang
Hailong Jiang
Computer Science, Youngstown State University
Fault tolerantHPC systemCompilerCode Intelligence