KPIRoot+: An Efficient Integrated Framework for Anomaly Detection and Root Cause Analysis in Large-Scale Cloud Systems

πŸ“… 2025-06-05
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address inefficiency, poor interpretability, and weak temporal modeling in KPI anomaly detection and root cause analysis (RCA) for large-scale cloud systems, this paper proposes an enhanced causal-similarity fusion framework. Methodologically, it (1) refines Symbolic Aggregate approXimation (SAX) into a differentiable trend encoding scheme to enhance fine-grained anomaly identification; (2) integrates causal graph modeling with a lightweight temporal convolutional network for interpretable, topology-aware RCA; and (3) introduces a dynamic thresholding mechanism to mitigate false negatives inherent in static thresholding. Evaluated on real-world cloud infrastructure, the method achieves F1-score improvements of 2.9–35.7% over eight state-of-the-art baselines, while reducing inference latency by 34.7%. It has been deployed in production systems of a leading cloud service provider.

Technology Category

Application Category

πŸ“ Abstract
To ensure the reliability of cloud systems, their performance is monitored using KPIs (key performance indicators). When issues arise, root cause localization identifies KPIs responsible for service degradation, aiding in quick diagnosis and resolution. Traditional methods rely on similarity calculations, which can be ineffective in complex, interdependent cloud environments. While deep learning-based approaches model these dependencies better, they often face challenges such as high computational demands and lack of interpretability. To address these issues, KPIRoot is proposed as an efficient method combining similarity and causality analysis. It uses symbolic aggregate approximation for compact KPI representation, improving analysis efficiency. However, deployment in Cloud H revealed two drawbacks: 1) threshold-based anomaly detection misses some performance anomalies, and 2) SAX representation fails to capture intricate variation trends. KPIRoot+ addresses these limitations, outperforming eight state-of-the-art baselines by 2.9% to 35.7%, while reducing time cost by 34.7%. We also share our experience deploying KPIRoot in a large-scale cloud provider's production environment.
Problem

Research questions and friction points this paper is trying to address.

Detects anomalies in large-scale cloud KPI systems efficiently
Improves root cause analysis accuracy in complex cloud environments
Reduces computational costs while maintaining interpretability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines similarity and causality analysis
Uses symbolic aggregate approximation
Improves efficiency and detection accuracy
πŸ”Ž Similar Papers
No similar papers found.