MetaRCA: A Generalizable Root Cause Analysis Framework for Cloud-Native Systems Powered by Meta Causal Knowledge

📅 2026-03-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Root cause analysis (RCA) in cloud-native systems faces significant challenges, including poor scalability, limited cross-system generalizability, and insufficient integration of domain knowledge. This work proposes the first universal RCA framework based on Meta-Causal Graphs (MCG), which fuses large language models, historical incident reports, and real-time observability data to construct a reusable, metadata-level causal knowledge base. The framework enables context-aware dynamic graph instantiation and weighted pruning of causal chains. Evaluated on 252 public and 59 production incidents, the approach outperforms the strongest baseline by 29 and 48 percentage points in service-level and metric-level accuracy, respectively. Notably, it maintains near-linear inference overhead despite increasing complexity and achieves over 80% cross-system accuracy.

Technology Category

Application Category

📝 Abstract
The dynamics and complexity of cloud-native systems present significant challenges for Root Cause Analysis (RCA). While causality-based RCA methods have shown significant progress in recent years, their practical adoption is fundamentally limited by three intertwined challenges: poor scalability against system complexity, brittle generalization across different system topologies, and inadequate integration of domain knowledge. These limitations create a vicious cycle, hindering the development of robust and efficient RCA solutions. This paper introduces MetaRCA, a generalizable RCA framework for cloud-native systems. MetaRCA first constructs a Meta Causal Graph (MCG) offline, a reusable knowledge base defined at the metadata level. To build the MCG, we propose an evidence-driven algorithm that systematically fuses knowledge from Large Language Models (LLMs), historical fault reports, and observability data. When a fault occurs, MetaRCA performs a lightweight online inference by dynamically instantiating the MCG into a localized graph based on the current context, and then leverages real-time data to weight and prune causal links for precise root cause localization. Evaluated on 252 public and 59 production failures, MetaRCA demonstrates state-of-the-art performance. It surpasses the strongest baseline by 29 percentage points in service-level and 48 percentage points in metric-level accuracy. This performance advantage widens as system complexity increases, with its overhead scaling near-linearly. Crucially, MetaRCA shows robust cross-system generalization, maintaining over 80% accuracy across diverse systems.
Problem

Research questions and friction points this paper is trying to address.

Root Cause Analysis
Cloud-Native Systems
Causal Inference
Generalization
Scalability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Meta Causal Graph
Root Cause Analysis
Cloud-Native Systems
Causal Inference
Generalizable Framework
🔎 Similar Papers
No similar papers found.
Shuai Liang
Shuai Liang
College of Environmental Science and Engineering, Beijing Forestry University
Membrane TechnologyNanotechnology
Pengfei Chen
Pengfei Chen
Sun Yat-sen University, Associated Professor
Distributed computingCloud computing and Blockchain
B
Bozhe Tian
China Unicom Software Research Institute, China
G
Gou Tan
Sun Yat-sen University, China
M
Maohong Xu
China Unicom Software Research Institute, China
Y
Youjun Qu
China Unicom Software Research Institute, China
Y
Yahui Zhao
China Unicom Software Research Institute, China
Y
Yiduo Shang
China Unicom Software Research Institute, China
C
Chongkang Tan
Individual Researcher, China