🤖 AI Summary
Anomaly detection in large-scale distributed systems faces three core challenges: (1) real-time processing of high-throughput time-series data, (2) adaptive modeling of heterogeneous metrics across engineering, business, and operational domains, and (3) explainable root-cause localization lacking rigorous causal inference. This paper proposes a real-time, scalable end-to-end framework addressing these challenges. Methodologically, it introduces (1) an inference-driven hierarchical detection architecture integrating stream processing with multidimensional time-series modeling; (2) mSelect, an automated technique jointly optimizing algorithm selection and hyperparameter tuning; and (3) an integrated causal inference module enabling efficient, interpretable root-cause attribution. Evaluated on nine public benchmarks, the framework achieves state-of-the-art performance across five key metrics and attains AUC > 0.85 on seven—consistently outperforming mainstream approaches.
📝 Abstract
Detecting anomalies in large, distributed systems presents several challenges. The first challenge arises from the sheer volume of data that needs to be processed. Flagging anomalies in a high-throughput environment calls for a careful consideration of both algorithm and system design. The second challenge comes from the heterogeneity of time-series datasets that leverage such a system in production. In practice, anomaly detection systems are rarely deployed for a single use case. Typically, there are several metrics to monitor, often across several domains (e.g. engineering, business and operations). A one-size-fits-all approach rarely works, so these systems need to be fine-tuned for every application - this is often done manually. The third challenge comes from the fact that determining the root-cause of anomalies in such settings is akin to finding a needle in a haystack. Identifying (in real time) a time-series dataset that is associated causally with the anomalous time-series data is a very difficult problem. In this paper, we describe a unified framework that addresses these challenges. Reasoning based Anomaly Detection Framework (RADF) is designed to perform real time anomaly detection on very large datasets. This framework employs a novel technique (mSelect) that automates the process of algorithm selection and hyper-parameter tuning for each use case. Finally, it incorporates a post-detection capability that allows for faster triaging and root-cause determination. Our extensive experiments demonstrate that RADF, powered by mSelect, surpasses state-of-the-art anomaly detection models in AUC performance for 5 out of 9 public benchmarking datasets. RADF achieved an AUC of over 0.85 for 7 out of 9 datasets, a distinction unmatched by any other state-of-the-art model.