EventADL: Open-Box Anomaly Detection and Localization Framework for Events in Cloud-Based Service Systems

πŸ“… 2026-04-30
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

211K/year
πŸ€– AI Summary
This work addresses the limitations of existing cloud system anomaly detection and localization approaches, which predominantly rely on metrics and logs while overlooking critical anomaly signals and root cause clues embedded in event data. To bridge this gap, we propose the first out-of-the-box Anomaly Detection and Localization (ADL) framework tailored specifically for event data. Our method systematically characterizes normal system behavior by offline learning Event Semantic Patterns (ESPs) and Event Frequency Patterns (EFPs) from real-world event traces, then detects online deviations from these learned patterns. Furthermore, it constructs an intervention graph to enable unsupervised and interpretable root cause inference. Experimental evaluation on three real-world cloud systems and two actual failure cases demonstrates that our approach achieves over 90% F1 score in anomaly detection and 100% Top-3 accuracy in root cause localization.
πŸ“ Abstract
Anomaly detection and localization (ADL) is critical for maintaining reliability and availability in cloud systems. Recent ADL developments focus on metric and log data, leaving event data unexplored. To address this gap, we propose EventADL, the first open-box event-based ADL framework for cloud-based service systems. To motivate the design of our framework, we conduct a systematic analysis on 520 real-world incidents, and provide insights into how anomalies and their root causes manifest through event data. EventADL has three phases: offline training, online anomaly detection, and root cause localization. During the training phase, EventADL first learns Event Semantic Patterns (ESPs), which capture normal interactions between system entities using historical event data, and then learns Event Frequency Patterns (EFPs), which capture the normal frequency of known ESPs. In the online anomaly detection phase, any data in the event stream that deviates significantly from either pattern is identified as anomalous. For localization, EventADL constructs an Intervention Graph that models the relationships between recent system interactions and the detected anomalies for automatic root cause localization. The framework is designed to operate efficiently with unlabeled data and to produce interpretable anomalies with their corresponding root causes. Our evaluation on three real cloud service systems and two real-world incidents demonstrates that EventADL outperforms existing methods, achieving F1-scores of at least 90% for anomaly detection and 100% top-3 accuracy in root cause localization.
Problem

Research questions and friction points this paper is trying to address.

anomaly detection
event data
cloud systems
root cause localization
unlabeled data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Event-based Anomaly Detection
Semantic Pattern Learning
Intervention Graph
Root Cause Localization
Unsupervised ADL
πŸ”Ž Similar Papers
No similar papers found.