PANDA: Noise-Resilient Antagonist Identification in Production Datacenters

📅 2025-11-11

📈 Citations: 0

✨ Influential: 0

career value

238K/year

🤖 AI Summary

Performance interference caused by co-locating heterogeneous workloads in modern data centers severely degrades resource efficiency and system stability. Existing adversarial job detection methods either incur high overhead via offline analysis or fail under sampling noise and multi-victim scenarios. This paper proposes a lightweight, robust online detection method: first, it constructs a machine-level cycles-per-instruction (CPI) metric to quantify shared-resource contention intensity; second, it incorporates global historical trajectory knowledge to suppress measurement noise and enable precise interference localization across multiple victims. Evaluated on Google production traces, our approach improves the ranking accuracy of true adversarial jobs from 50–55% to 82.6%, with negligible runtime overhead. The method provides a practical, scalable solution for performance interference management in large-scale clusters.

Technology Category

Application Category

📝 Abstract

Modern warehouse-scale datacenters commonly collocate multiple jobs on shared machines to improve resource utilization. However, such collocation often leads to performance interference caused by antagonistic jobs that overconsume shared resources. Existing antagonist-detection approaches either rely on offline profiling, which is costly and unscalable, or use a sample-from-production approach, which suffers from noisy measurements and fails under multi-victim scenarios. We present PANDA, a noise-resilient antagonist identification framework for production-scale datacenters. Like prior correlation-based methods, PANDA uses cycles per instruction (CPI) as its performance metric, but it differs by (i) leveraging global historical knowledge across all machines to suppress sampling noise and (ii) introducing a machine-level CPI metric that captures shared-resource contention among multiple co-located tasks. Evaluation on a recent Google production trace shows that PANDA ranks true antagonists far more accurately than prior methods -- improving average suspicion percentile from 50-55% to 82.6% -- and achieves consistent antagonist identification under multi-victim scenarios, all with negligible runtime overhead.

Problem

Research questions and friction points this paper is trying to address.

Identifies performance-interfering jobs in shared datacenters

Addresses noisy measurements in multi-victim collocation scenarios

Improves detection accuracy without costly offline profiling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses global historical knowledge to suppress noise

Introduces machine-level CPI for multi-victim scenarios

Achieves accurate antagonist identification with low overhead

🔎 Similar Papers

Dynamic Cluster Analysis to Detect and Track Novelty in Network Telescopes

2024-05-172024 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW)Citations: 0

💼 Related Jobs

Performance Engineer

Anthropic

$280,000—$850,000 USD

San Francisco, CA, USA

AI/HPC System Performance Engineer