PANDA: Noise-Resilient Antagonist Identification in Production Datacenters

📅 2025-11-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Performance interference caused by co-locating heterogeneous workloads in modern data centers severely degrades resource efficiency and system stability. Existing adversarial job detection methods either incur high overhead via offline analysis or fail under sampling noise and multi-victim scenarios. This paper proposes a lightweight, robust online detection method: first, it constructs a machine-level cycles-per-instruction (CPI) metric to quantify shared-resource contention intensity; second, it incorporates global historical trajectory knowledge to suppress measurement noise and enable precise interference localization across multiple victims. Evaluated on Google production traces, our approach improves the ranking accuracy of true adversarial jobs from 50–55% to 82.6%, with negligible runtime overhead. The method provides a practical, scalable solution for performance interference management in large-scale clusters.

Technology Category

Application Category

📝 Abstract
Modern warehouse-scale datacenters commonly collocate multiple jobs on shared machines to improve resource utilization. However, such collocation often leads to performance interference caused by antagonistic jobs that overconsume shared resources. Existing antagonist-detection approaches either rely on offline profiling, which is costly and unscalable, or use a sample-from-production approach, which suffers from noisy measurements and fails under multi-victim scenarios. We present PANDA, a noise-resilient antagonist identification framework for production-scale datacenters. Like prior correlation-based methods, PANDA uses cycles per instruction (CPI) as its performance metric, but it differs by (i) leveraging global historical knowledge across all machines to suppress sampling noise and (ii) introducing a machine-level CPI metric that captures shared-resource contention among multiple co-located tasks. Evaluation on a recent Google production trace shows that PANDA ranks true antagonists far more accurately than prior methods -- improving average suspicion percentile from 50-55% to 82.6% -- and achieves consistent antagonist identification under multi-victim scenarios, all with negligible runtime overhead.
Problem

Research questions and friction points this paper is trying to address.

Identifies performance-interfering jobs in shared datacenters
Addresses noisy measurements in multi-victim collocation scenarios
Improves detection accuracy without costly offline profiling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses global historical knowledge to suppress noise
Introduces machine-level CPI for multi-victim scenarios
Achieves accurate antagonist identification with low overhead
🔎 Similar Papers
No similar papers found.
S
Sixiang Zhou
Purdue University
N
Nan Deng
Google Inc.
K
Krzysiek Rzadca
Google Inc.
Xiaojun Lin
Xiaojun Lin
Professor of Information Engineering, CUHK, and Professor of ECE, Purdue University (on leave)
Communication networkssmart gridlearning theory
Y
Y. C. Hu
Purdue University