Robust Root Cause Diagnosis using In-Distribution Interventions

📅 2025-05-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Root cause diagnosis in cloud services and industrial systems remains challenging, particularly for rare anomalies; existing counterfactual reasoning methods based on structural causal models (SCMs) suffer from extrapolation failure under such anomalies. Method: This paper proposes In-Distribution Intervention (IDI), a novel approach that avoids counterfactual estimation on anomalous samples. Instead, IDI constrains interventions strictly within the training data distribution and jointly evaluates each node’s “anomalousness” and “repairability” to localize root causes. Contribution/Results: Theoretically, we establish the first error bound on causal effect estimation under IDI, precisely characterizing its applicability conditions. Technically, IDI integrates SCM-based causal modeling, in-distribution optimization, and causal effect assessment. Empirical evaluation on synthetic data and the PetShop RCD benchmark demonstrates that IDI significantly outperforms nine state-of-the-art methods, achieving superior root cause identification accuracy and enhanced out-of-distribution robustness.

Technology Category

Application Category

📝 Abstract
Diagnosing the root cause of an anomaly in a complex interconnected system is a pressing problem in today's cloud services and industrial operations. We propose In-Distribution Interventions (IDI), a novel algorithm that predicts root cause as nodes that meet two criteria: 1) **Anomaly:** root cause nodes should take on anomalous values; 2) **Fix:** had the root cause nodes assumed usual values, the target node would not have been anomalous. Prior methods of assessing the fix condition rely on counterfactuals inferred from a Structural Causal Model (SCM) trained on historical data. But since anomalies are rare and fall outside the training distribution, the fitted SCMs yield unreliable counterfactual estimates. IDI overcomes this by relying on interventional estimates obtained by solely probing the fitted SCM at in-distribution inputs. We present a theoretical analysis comparing and bounding the errors in assessing the fix condition using interventional and counterfactual estimates. We then conduct experiments by systematically varying the SCM's complexity to demonstrate the cases where IDI's interventional approach outperforms the counterfactual approach and vice versa. Experiments on both synthetic and PetShop RCD benchmark datasets demonstrate that our consistently identifies true root causes more accurately and robustly than nine existing state-of-the-art RCD baselines. Code is released at https://github.com/nlokeshiisc/IDI_release.
Problem

Research questions and friction points this paper is trying to address.

Diagnosing root causes of anomalies in complex systems
Overcoming unreliable counterfactual estimates in SCMs
Improving accuracy and robustness in root cause identification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses In-Distribution Interventions for root cause diagnosis
Relies on interventional estimates from fitted SCM
Outperforms counterfactual methods in accuracy
🔎 Similar Papers
No similar papers found.