🤖 AI Summary
In multi-tenant public clouds, performance interference among virtual machines (VMs) caused by contention for shared hardware resources—such as last-level cache and memory bandwidth—is difficult to predict accurately, primarily due to highly dynamic workloads and the black-box nature of VMs. This paper proposes CloudFormer, a dual-branch Transformer model that jointly encodes 206-dimensional high-resolution system metrics—including static configurations and dynamic runtime monitoring—leveraging self-attention to explicitly capture transient resource contention patterns. Its novel temporal-interaction dual-path architecture enables zero-shot generalization to unseen workloads without scenario-specific tuning. We also release the first fine-grained cloud interference dataset. Experiments show CloudFormer achieves a mean absolute error of only 7.8%, improving prediction accuracy by ≥28% over state-of-the-art methods, significantly enhancing both accuracy and generalizability in performance interference forecasting.
📝 Abstract
Cloud platforms are increasingly relied upon to host diverse, resource-intensive workloads due to their scalability, flexibility, and cost-efficiency. In multi-tenant cloud environments, virtual machines are consolidated on shared physical servers to improve resource utilization. While virtualization guarantees resource partitioning for CPU, memory, and storage, it cannot ensure performance isolation. Competition for shared resources such as last-level cache, memory bandwidth, and network interfaces often leads to severe performance degradation. Existing management techniques, including VM scheduling and resource provisioning, require accurate performance prediction to mitigate interference. However, this remains challenging in public clouds due to the black-box nature of VMs and the highly dynamic nature of workloads. To address these limitations, we propose CloudFormer, a dual-branch Transformer-based model designed to predict VM performance degradation in black-box environments. CloudFormer jointly models temporal dynamics and system-level interactions, leveraging 206 system metrics at one-second resolution across both static and dynamic scenarios. This design enables the model to capture transient interference effects and adapt to varying workload conditions without scenario-specific tuning. Complementing the methodology, we provide a fine-grained dataset that significantly expands the temporal resolution and metric diversity compared to existing benchmarks. Experimental results demonstrate that CloudFormer consistently outperforms state-of-the-art baselines across multiple evaluation metrics, achieving robust generalization across diverse and previously unseen workloads. Notably, CloudFormer attains a mean absolute error (MAE) of just 7.8%, representing a substantial improvement in predictive accuracy and outperforming existing methods at least by 28%.