RADICE: Causal Graph Based Root Cause Analysis for System Performance Diagnostic

📅 2025-01-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of root cause localization in complex software systems—particularly susceptibility to spurious correlations and incomplete domain expertise—this paper proposes a causal graph modeling method that integrates partial domain knowledge. We introduce a novel four-stage framework: (1) initial causal structure learning via PC/GES variants; (2) reliability enhancement of causal edges using graph neural networks; (3) redundancy elimination through counterfactual reasoning; and (4) lightweight domain knowledge injection, enabling analysts to initiate analysis with only localized expert priors. Evaluated on both synthetic and real-world industrial datasets, our approach achieves a 27.3% improvement in root cause localization accuracy and reduces average causal path length by 41%, outperforming state-of-the-art causal discovery and correlation-based methods. The framework has been deployed in a cloud platform’s performance operations system.

Technology Category

Application Category

📝 Abstract
Root cause analysis is one of the most crucial operations in software reliability regarding system performance diagnostic. It aims to identify the root causes of system performance anomalies, allowing the resolution or the future prevention of issues that can cause millions of dollars in losses. Common existing approaches relying on data correlation or full domain expert knowledge are inaccurate or infeasible in most industrial cases, since correlation does not imply causation, and domain experts may not have full knowledge of complex and real-time systems. In this work, we define a novel causal domain knowledge model representing causal relations about the underlying system components to allow domain experts to contribute partial domain knowledge for root cause analysis. We then introduce RADICE, an algorithm that through the causal graph discovery, enhancement, refinement, and subtraction processes is able to output a root cause causal sub-graph showing the causal relations between the system components affected by the anomaly. We evaluated RADICE with simulated data and reported a real data use case, sharing the lessons we learned. The experiments show that RADICE provides better results than other baseline methods, including causal discovery algorithms and correlation based approaches for root cause analysis.
Problem

Research questions and friction points this paper is trying to address.

Causal Inference
Performance Troubleshooting
Complex Systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

RADICE algorithm
Causal Relationships
Expert Local Knowledge
🔎 Similar Papers
No similar papers found.