GAL-MAD: Towards Explainable Anomaly Detection in Microservice Applications Using Graph Attention Networks

📅 2025-03-31

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

The dynamic, distributed nature of microservice architectures poses significant challenges for anomaly detection, particularly in modeling high-dimensional temporal dependencies and localizing root causes. To address these issues, we introduce RS-Anomic—the first benchmark dataset covering ten anomaly types and incorporating multimodal monitoring data (metrics, logs, and traces). We propose GAL-MAD, a novel model that jointly captures service topology and temporal dynamics by integrating Graph Attention Networks (GAT) with LSTM. Furthermore, we incorporate SHAP to enable interpretable, service-level root-cause diagnosis. Evaluated on RS-Anomic, GAL-MAD achieves statistically significant improvements in both accuracy and recall over state-of-the-art methods. It supports fine-grained anomaly localization and actionable root-cause analysis, establishing a new paradigm for intelligent microservice operations and maintenance.

Technology Category

Application Category

📝 Abstract

The transition to microservices has revolutionized software architectures, offering enhanced scalability and modularity. However, the distributed and dynamic nature of microservices introduces complexities in ensuring system reliability, making anomaly detection crucial for maintaining performance and functionality. Anomalies stemming from network and performance issues must be swiftly identified and addressed. Existing anomaly detection techniques often rely on statistical models or machine learning methods that struggle with the high-dimensional, interdependent data inherent in microservice applications. Current techniques and available datasets predominantly focus on system traces and logs, limiting their ability to support advanced detection models. This paper addresses these gaps by introducing the RS-Anomic dataset generated using the open-source RobotShop microservice application. The dataset captures multivariate performance metrics and response times under normal and anomalous conditions, encompassing ten types of anomalies. We propose a novel anomaly detection model called Graph Attention and LSTM-based Microservice Anomaly Detection (GAL-MAD), leveraging Graph Attention and Long Short-Term Memory architectures to capture spatial and temporal dependencies in microservices. We utilize SHAP values to localize anomalous services and identify root causes to enhance explainability. Experimental results demonstrate that GAL-MAD outperforms state-of-the-art models on the RS-Anomic dataset, achieving higher accuracy and recall across varying anomaly rates. The explanations provide actionable insights into service anomalies, which benefits system administrators.

Problem

Research questions and friction points this paper is trying to address.

Detect anomalies in microservices using graph attention networks

Address high-dimensional interdependent data challenges in microservices

Enhance explainability and root cause analysis for anomalies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Graph Attention Networks for anomaly detection

Combines LSTM to capture temporal dependencies

Employs SHAP values for explainable root causes

🔎 Similar Papers

Graph Anomaly Detection in Time Series: A Survey