An Empirical Study of SOTA RCA Models: From Oversimplified Benchmarks to Realistic Failures

📅 2025-10-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing root cause analysis (RCA) models exhibit inflated performance on simplified benchmarks and fail to generalize to real-world cloud-native microservice failures. Method: We systematically identify fundamental flaws in mainstream benchmarks across three dimensions—call graph realism, fault diversity, and SLI consistency—and construct the first large-scale, real-world failure dataset comprising 1,430 SLI-validated cases, covering 25 fault types, hierarchical root-cause annotations, and dynamic workload simulation. Our generative framework integrates realistic call graph design, multi-granularity fault injection, and telemetry pattern modeling. Contribution/Results: Re-evaluating 11 state-of-the-art models and rule-based baselines on this dataset reveals a mean Top@1 accuracy of only 0.21 (max 0.37), with significantly increased inference latency—exposing critical, shared bottlenecks in interpretability, robustness, and efficiency.

Technology Category

Application Category

📝 Abstract
While cloud-native microservice architectures have transformed software development, their complexity makes Root Cause Analysis (RCA) both crucial and challenging. Although many data-driven RCA models have been proposed, we find that existing benchmarks are often oversimplified and fail to capture real-world conditions. Our preliminary study shows that simple rule-based methods can match or even outperform state-of-the-art (SOTA) models on four widely used benchmarks, suggesting performance overestimation due to benchmark simplicity. To address this, we systematically analyze popular RCA benchmarks and identify key limitations in fault injection, call graph design, and telemetry patterns. Based on these insights, we develop an automated framework to generate more realistic benchmarks, yielding a dataset of 1,430 validated failure cases from 9,152 injections, covering 25 fault types under dynamic workloads with hierarchical ground-truth labels and verified SLI impact. Re-evaluation of 11 SOTA models on this dataset shows low Top@1 accuracy (average 0.21, best 0.37) and significantly longer execution times. Our analysis highlights three common failure patterns: scalability issues, observability blind spots, and modeling bottlenecks.
Problem

Research questions and friction points this paper is trying to address.

Evaluating SOTA RCA models on oversimplified benchmarks
Identifying limitations in fault injection and telemetry patterns
Developing realistic benchmarks revealing model scalability issues
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated framework generates realistic failure benchmarks
Systematically analyzes limitations in fault injection and telemetry
Hierarchical ground-truth labels cover 25 fault types
🔎 Similar Papers
No similar papers found.
Aoyang Fang
Aoyang Fang
The Chinese University of Hong Kong, Shenzhen
Software testingAIopsRoot cause analysis
S
Songhan Zhang
The Chinese University of Hong Kong, Shenzhen, China
Y
Yifan Yang
The Chinese University of Hong Kong, Shenzhen, China
H
Haotong Wu
The Chinese University of Hong Kong, Shenzhen, China
Junjielong Xu
Junjielong Xu
The Chinese University of Hong Kong, Shenzhen
AI4SE
Xuyang Wang
Xuyang Wang
Australian National University
Generative Modeling3D VisionDeep Learning
R
Rui Wang
The Chinese University of Hong Kong, Shenzhen, China
M
Manyi Wang
The Chinese University of Hong Kong, Shenzhen, China
Q
Qisheng Lu
The Chinese University of Hong Kong, Shenzhen, China
Pinjia He
Pinjia He
Assistant Professor, The Chinese University of Hong Kong, Shenzhen
Software EngineeringAI4SESE4AIAIOps