A Performance Analyzer for a Public Cloud's ML-Augmented VM Allocator

📅 2025-12-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Cloud system operators struggle to diagnose end-to-end performance bottlenecks in multi-ML-model co-deployments. This paper introduces SANJESH—the first explainable performance analysis tool tailored for ML-augmented systems in public clouds. Its core is a two-tier optimization modeling framework that integrates efficient solving algorithms with counterfactual scenario analysis to enable precise causal attribution of inter-model interference in VM allocation. Experiments show that existing simulation-based methods underestimate performance degradation by up to 4× on average; SANJESH solves complex optimization problems—unsolvable by conventional methods within 24 hours—in mere minutes. By delivering actionable, interpretable insights into resource contention and model interactions, SANJESH significantly improves resource utilization efficiency and enhances the transparency of migration decisions. It establishes a new paradigm for performance engineering of production ML systems.

Technology Category

Application Category

📝 Abstract
Many operational cloud systems use one or more machine learning models that help them achieve better efficiency and performance. But operators do not have tools to help them understand how each model and the interaction between them affect the end-to-end system performance. SANJESH is such a tool. SANJESH supports a diverse set of performance-related queries which we answer through a bi-level optimization. We invent novel mechanisms to solve this optimization more quickly. These techniques allow us to solve an optimization which prior work failed to solve even after $24$ hours. As a proof of concept, we apply SANJESH to an example production system that uses multiple ML models to optimize virtual machine (VM) placement. These models impact how many servers the operators uses to host VMs and the frequency with which it has to live-migrate them because the servers run out of resources. SANJESH finds scenarios where these models cause $~4 imes$ worse performance than what simulation-based approaches detect.
Problem

Research questions and friction points this paper is trying to address.

Analyzes ML models' impact on cloud system performance.
Optimizes VM placement using bi-level optimization techniques.
Identifies performance issues missed by simulation-based methods.
Innovation

Methods, ideas, or system contributions that make the work stand out.

SANJESH uses bi-level optimization for performance queries
Novel mechanisms accelerate solving previously intractable optimization
Applies to ML-augmented VM allocators to detect performance issues
🔎 Similar Papers
No similar papers found.