Coach: Exploiting Temporal Patterns for All-Resource Oversubscription in Cloud Platforms

📅 2025-01-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Cloud platforms suffer from low CPU and memory utilization, redundant resource reservations, and substantial idle resource waste. To address these issues, this paper proposes CoachVM—a holistic resource (especially memory) overcommitment system grounded in temporal complementarity patterns among virtual machines (VMs). We introduce the novel CoachVM VM type, which partitions each resource into guaranteed and overcommitted shares. Furthermore, we design the first lightweight online memory overcommitment monitoring and contention-mitigation mechanism, enabling coordinated CPU-memory overcommitment. Our approach integrates long-horizon resource demand forecasting, temporal complementarity modeling, hierarchical scheduling, and real-time memory contention detection with migration avoidance. Evaluated on real Azure production workloads, CoachVM increases VM consolidation density by 26% while limiting performance degradation to ≤1%, significantly outperforming state-of-the-art systems.

Technology Category

Application Category

📝 Abstract
Cloud platforms remain underutilized despite multiple proposals to improve their utilization (e.g., disaggregation, harvesting, and oversubscription). Our characterization of the resource utilization of virtual machines (VMs) in Azure reveals that, while CPU is the main underutilized resource, we need to provide a solution to manage all resources holistically. We also observe that many VMs exhibit complementary temporal patterns, which can be leveraged to improve the oversubscription of underutilized resources. Based on these insights, we propose Coach: a system that exploits temporal patterns for all-resource oversubscription in cloud platforms. Coach uses long-term predictions and an efficient VM scheduling policy to exploit temporally complementary patterns. We introduce a new general-purpose VM type, called CoachVM, where we partition each resource allocation into a guaranteed and an oversubscribed portion. Coach monitors the oversubscribed resources to detect contention and mitigate any potential performance degradation. We focus on memory management, which is particularly challenging due to memory's sensitivity to contention and the overhead required to reassign it between CoachVMs. Our experiments show that Coach enables platforms to host up to ~26% more VMs with minimal performance degradation.
Problem

Research questions and friction points this paper is trying to address.

Cloud Platform Efficiency
CPU Utilization
Resource Optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cloud Platform Resource Optimization
Predictive Intelligent Scheduling
Enhanced Virtual Machine Capacity
🔎 Similar Papers
No similar papers found.
Benjamin Reidys
Benjamin Reidys
University of Illinois at Urbana-Champaign
Systems and Networking
Pantea Zardoshti
Pantea Zardoshti
Senior Research SDE
Distributed SystemsCloud ComputingParallel ComputingConcurrency
Íñigo Goiri
Íñigo Goiri
Research Software Developer, Azure Systems Research
Energy managementVirtualizationGreen ComputingDistributed Systems
Celine Irvene
Celine Irvene
Georgia Institute of Technology
BAS securityInternet of ThingsCPS securityNetworking
Daniel S. Berger
Daniel S. Berger
Microsoft
Distributed systemscloud efficiencydatacenterssystemsperformance evaluation
Haoran Ma
Haoran Ma
PhD Student, University of California, Los Angeles
Computer SystemsSoftware Engineering
Kapil Arya
Kapil Arya
NVIDIA
Operating SystemsVirtualizationCheckpointingHigh Performance Computing
Eli Cortez
Eli Cortez
Microsoft
MLSysDistributed SystemsDatabasesWorld Wide Web
T
Taylor Stark
Microsoft Redmond, USA
E
Eugene Bak
Microsoft Redmond, USA
M
Mehmet Iyigun
Microsoft Redmond, USA
S
Stanko Novaković
Google Mountain View, USA; Microsoft (former)
L
Lisa Hsu
Meta Menlo Park, USA; Microsoft (former)
K
Karel Trueba
Microsoft Redmond, USA
Abhisek Pan
Abhisek Pan
Principal Software Engineer, Azure Compute, Microsoft
Computer ArchitectureCloud ComputingParallel Programming
Chetan Bansal
Chetan Bansal
Microsoft
AI AgentsDistributed SystemsSoftware Engineering
S
Saravan Rajmohan
Microsoft Redmond, USA
J
Jian Huang
University of Illinois Urbana-Champaign, USA
Ricardo Bianchini
Ricardo Bianchini
Technical Fellow, Corporate VP at Microsoft Azure
Cloud computingDatacentersEfficiencyPower managementSustainability