🤖 AI Summary
Existing root cause analysis (RCA) models exhibit inflated performance on simplified benchmarks and fail to generalize to real-world cloud-native microservice failures. Method: We systematically identify fundamental flaws in mainstream benchmarks across three dimensions—call graph realism, fault diversity, and SLI consistency—and construct the first large-scale, real-world failure dataset comprising 1,430 SLI-validated cases, covering 25 fault types, hierarchical root-cause annotations, and dynamic workload simulation. Our generative framework integrates realistic call graph design, multi-granularity fault injection, and telemetry pattern modeling. Contribution/Results: Re-evaluating 11 state-of-the-art models and rule-based baselines on this dataset reveals a mean Top@1 accuracy of only 0.21 (max 0.37), with significantly increased inference latency—exposing critical, shared bottlenecks in interpretability, robustness, and efficiency.
📝 Abstract
While cloud-native microservice architectures have transformed software development, their complexity makes Root Cause Analysis (RCA) both crucial and challenging. Although many data-driven RCA models have been proposed, we find that existing benchmarks are often oversimplified and fail to capture real-world conditions. Our preliminary study shows that simple rule-based methods can match or even outperform state-of-the-art (SOTA) models on four widely used benchmarks, suggesting performance overestimation due to benchmark simplicity. To address this, we systematically analyze popular RCA benchmarks and identify key limitations in fault injection, call graph design, and telemetry patterns. Based on these insights, we develop an automated framework to generate more realistic benchmarks, yielding a dataset of 1,430 validated failure cases from 9,152 injections, covering 25 fault types under dynamic workloads with hierarchical ground-truth labels and verified SLI impact. Re-evaluation of 11 SOTA models on this dataset shows low Top@1 accuracy (average 0.21, best 0.37) and significantly longer execution times. Our analysis highlights three common failure patterns: scalability issues, observability blind spots, and modeling bottlenecks.