🤖 AI Summary
Cloud data centers face challenges in containerized infrastructure resource orchestration—including low accuracy and poor stability—due to vast configuration search spaces, high workload variability, and strong environmental noise. To address these, we propose a novel resource orchestration framework integrating Contextual Multi-Armed Bandits (CMAB) with the Ksurf variance-minimization estimator. This work is the first to embed Ksurf into the Drone scheduler and augment it with attention-based Kalman filtering to dynamically suppress nonlinear noise. Implemented within Kubernetes, our approach enables fine-grained, low-overhead, real-time resource configuration optimization. Evaluated on the VarBench benchmark, it reduces p95 and p99 latency variance by 41% and 47%, respectively; decreases CPU utilization by 4%; lowers master-node memory footprint by 7 MB; and reduces average active Pod count by 7%. These improvements significantly enhance resource efficiency and cloud cost-effectiveness.
📝 Abstract
Resource orchestration and configuration parameter search are key concerns for container-based infrastructure in cloud data centers. Large configuration search space and cloud uncertainties are often mitigated using contextual bandit techniques for resource orchestration including the state-of-the-art Drone orchestrator. Complexity in the cloud provider environment due to varying numbers of virtual machines introduces variability in workloads and resource metrics, making orchestration decisions less accurate due to increased nonlinearity and noise. Ksurf, a state-of-the-art variance-minimizing estimator method ideal for highly variable cloud data, enables optimal resource estimation under conditions of high cloud variability. This work evaluates the performance of Ksurf on estimation-based resource orchestration tasks involving highly variable workloads when employed as a contextual multi-armed bandit objective function model for cloud scenarios using Drone. Ksurf enables significantly lower latency variance of $41%$ at p95 and $47%$ at p99, demonstrates a $4%$ reduction in CPU usage and 7 MB reduction in master node memory usage on Kubernetes, resulting in a $7%$ cost savings in average worker pod count on VarBench Kubernetes benchmark.