🤖 AI Summary
In microservice architectures, anomaly detection and root cause localization face challenges including modeling time-varying dependencies and poor scalability of causal inference. This paper proposes an end-to-end, scalable approach grounded in functional connectivity—a concept adapted from neuroscience. It captures dynamic inter-metric dependencies via sliding-window correlation and integrates microservice topology using a graph attention network to jointly produce anomaly scores and rank root cause candidates. By avoiding expensive causal discovery, the method achieves high accuracy with low computational complexity. Evaluated across diverse fault scenarios—including latency spikes, service crashes, and resource exhaustion—it outperforms state-of-the-art baselines. Furthermore, it demonstrates strong effectiveness and scalability in Alibaba’s large-scale production environment, handling thousands of services and millions of metrics in real time.
📝 Abstract
Microservices have transformed software architecture through the creation of modular and independent services. However, they introduce operational complexities in service integration and system management that makes swift and accurate anomaly detection and localisation challenging. Despite the complex, dynamic, and interconnected nature of microservice architectures, prior works that investigate metrics for anomaly detection rarely include explicit information about time-varying interdependencies. And whilst prior works on fault localisation typically do incorporate information about dependencies between microservices, they scale poorly to real world large-scale deployments due to their reliance on computationally expensive causal inference. To address these challenges we propose FC-ADL, an end-to-end scalable approach for detecting and localising anomalous changes from microservice metrics based on the neuroscientific concept of functional connectivity. We show that by efficiently characterising time-varying changes in dependencies between microservice metrics we can both detect anomalies and provide root cause candidates without incurring the significant overheads of causal and multivariate approaches. We demonstrate that our approach can achieve top detection and localisation performance across a wide degree of different fault scenarios when compared to state-of-the-art approaches. Furthermore, we illustrate the scalability of our approach by applying it to Alibaba's extremely large real-world microservice deployment.