🤖 AI Summary
Performance interference caused by co-locating heterogeneous workloads in modern data centers severely degrades resource efficiency and system stability. Existing adversarial job detection methods either incur high overhead via offline analysis or fail under sampling noise and multi-victim scenarios. This paper proposes a lightweight, robust online detection method: first, it constructs a machine-level cycles-per-instruction (CPI) metric to quantify shared-resource contention intensity; second, it incorporates global historical trajectory knowledge to suppress measurement noise and enable precise interference localization across multiple victims. Evaluated on Google production traces, our approach improves the ranking accuracy of true adversarial jobs from 50–55% to 82.6%, with negligible runtime overhead. The method provides a practical, scalable solution for performance interference management in large-scale clusters.
📝 Abstract
Modern warehouse-scale datacenters commonly collocate multiple jobs on shared machines to improve resource utilization. However, such collocation often leads to performance interference caused by antagonistic jobs that overconsume shared resources. Existing antagonist-detection approaches either rely on offline profiling, which is costly and unscalable, or use a sample-from-production approach, which suffers from noisy measurements and fails under multi-victim scenarios. We present PANDA, a noise-resilient antagonist identification framework for production-scale datacenters. Like prior correlation-based methods, PANDA uses cycles per instruction (CPI) as its performance metric, but it differs by (i) leveraging global historical knowledge across all machines to suppress sampling noise and (ii) introducing a machine-level CPI metric that captures shared-resource contention among multiple co-located tasks. Evaluation on a recent Google production trace shows that PANDA ranks true antagonists far more accurately than prior methods -- improving average suspicion percentile from 50-55% to 82.6% -- and achieves consistent antagonist identification under multi-victim scenarios, all with negligible runtime overhead.