LLM Assisted Anomaly Detection Service for Site Reliability Engineers: Enhancing Cloud Infrastructure Resilience

📅 2025-01-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of delayed response and poor generalization in cloud system anomaly detection, this paper proposes an SRE-oriented scalable anomaly detection service. Methodologically, it pioneers the integration of large language models (LLMs) into cloud infrastructure anomaly modeling, enabling root-cause understanding and zero-shot detection; introduces a unified API architecture supporting regression, Gaussian Mixture Models (GMM), and semi-supervised detection for both univariate and multivariate time series; and synergistically combines time-series feature engineering with cloud-native service frameworks. Evaluated across public benchmarks and industrial IoT-AI scenarios, the service has been deployed to over 500 SRE users, achieving more than 200,000 annual invocations. Results demonstrate significantly reduced fault response time, lower system downtime rates, and improved customer experience.

Technology Category

Application Category

📝 Abstract
This paper introduces a scalable Anomaly Detection Service with a generalizable API tailored for industrial time-series data, designed to assist Site Reliability Engineers (SREs) in managing cloud infrastructure. The service enables efficient anomaly detection in complex data streams, supporting proactive identification and resolution of issues. Furthermore, it presents an innovative approach to anomaly modeling in cloud infrastructure by utilizing Large Language Models (LLMs) to understand key components, their failure modes, and behaviors. A suite of algorithms for detecting anomalies is offered in univariate and multivariate time series data, including regression-based, mixture-model-based, and semi-supervised approaches. We provide insights into the usage patterns of the service, with over 500 users and 200,000 API calls in a year. The service has been successfully applied in various industrial settings, including IoT-based AI applications. We have also evaluated our system on public anomaly benchmarks to show its effectiveness. By leveraging it, SREs can proactively identify potential issues before they escalate, reducing downtime and improving response times to incidents, ultimately enhancing the overall customer experience. We plan to extend the system to include time series foundation models, enabling zero-shot anomaly detection capabilities.
Problem

Research questions and friction points this paper is trying to address.

Internet Service Stability
Anomaly Detection
Cloud Systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Model
Anomaly Detection
Cloud Systems
🔎 Similar Papers
No similar papers found.