π€ AI Summary
To address inefficiency, poor interpretability, and weak temporal modeling in KPI anomaly detection and root cause analysis (RCA) for large-scale cloud systems, this paper proposes an enhanced causal-similarity fusion framework. Methodologically, it (1) refines Symbolic Aggregate approXimation (SAX) into a differentiable trend encoding scheme to enhance fine-grained anomaly identification; (2) integrates causal graph modeling with a lightweight temporal convolutional network for interpretable, topology-aware RCA; and (3) introduces a dynamic thresholding mechanism to mitigate false negatives inherent in static thresholding. Evaluated on real-world cloud infrastructure, the method achieves F1-score improvements of 2.9β35.7% over eight state-of-the-art baselines, while reducing inference latency by 34.7%. It has been deployed in production systems of a leading cloud service provider.
π Abstract
To ensure the reliability of cloud systems, their performance is monitored using KPIs (key performance indicators). When issues arise, root cause localization identifies KPIs responsible for service degradation, aiding in quick diagnosis and resolution. Traditional methods rely on similarity calculations, which can be ineffective in complex, interdependent cloud environments. While deep learning-based approaches model these dependencies better, they often face challenges such as high computational demands and lack of interpretability. To address these issues, KPIRoot is proposed as an efficient method combining similarity and causality analysis. It uses symbolic aggregate approximation for compact KPI representation, improving analysis efficiency. However, deployment in Cloud H revealed two drawbacks: 1) threshold-based anomaly detection misses some performance anomalies, and 2) SAX representation fails to capture intricate variation trends. KPIRoot+ addresses these limitations, outperforming eight state-of-the-art baselines by 2.9% to 35.7%, while reducing time cost by 34.7%. We also share our experience deploying KPIRoot in a large-scale cloud provider's production environment.