From Observability Data to Diagnosis: An Evolving Multi-agent System for Incident Management in Cloud Systems

📅 2025-10-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In large-scale cloud systems, manual event management is labor-intensive and error-prone, while existing automation approaches suffer from poor generalizability, limited interpretability, and high deployment overhead. To address these challenges, this paper proposes OpsAgent—a lightweight multi-agent system for automated root-cause analysis. Our method introduces three key innovations: (1) a training-free structural text transformer that unifies heterogeneous observability data (e.g., logs, metrics, traces) into semantically coherent natural language; (2) an interpretable and auditable diagnostic reasoning framework grounded in collaborative multi-agent deliberation; and (3) a dual-loop self-evolution mechanism integrating in-model adaptation with external experience accumulation. OpsAgent enables cross-platform transferability and low-cost closed-loop deployment. Evaluated on the OPENRCA benchmark, it achieves state-of-the-art performance, significantly improving generalizability, diagnostic transparency, deployment efficiency, and long-term operational sustainability.

Technology Category

Application Category

📝 Abstract
Incident management (IM) is central to the reliability of large-scale cloud systems. Yet manual IM, where on-call engineers examine metrics, logs, and traces is labor-intensive and error-prone in the face of massive and heterogeneous observability data. Existing automated IM approaches often struggle to generalize across systems, provide limited interpretability, and incur high deployment costs, which hinders adoption in practice. In this paper, we present OpsAgent, a lightweight, self-evolving multi-agent system for IM that employs a training-free data processor to convert heterogeneous observability data into structured textual descriptions, along with a multi-agent collaboration framework that makes diagnostic inference transparent and auditable. To support continual capability growth, OpsAgent also introduces a dual self-evolution mechanism that integrates internal model updates with external experience accumulation, thereby closing the deployment loop. Comprehensive experiments on the OPENRCA benchmark demonstrate state-of-the-art performance and show that OpsAgent is generalizable, interpretable, cost-efficient, and self-evolving, making it a practically deployable and sustainable solution for long-term operation in real-world cloud systems.
Problem

Research questions and friction points this paper is trying to address.

Automating incident diagnosis from heterogeneous cloud observability data
Overcoming limited generalization and interpretability in existing solutions
Reducing deployment costs while enabling continuous system evolution
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free data processor converts observability data to text
Multi-agent collaboration framework ensures transparent diagnostic inference
Dual self-evolution mechanism enables continuous capability growth
🔎 Similar Papers
No similar papers found.
Y
Yu Luo
Nankai University, Tianjin, China
J
Jiamin Jiang
Nankai University, Tianjin, China
J
Jingfei Feng
Nankai University, Tianjin, China
Lei Tao
Lei Tao
nankai university
AIOpsLLM4OpsHPC
Q
Qingliang Zhang
Nankai University, Tianjin, China
X
Xidao Wen
Bizseer, Beijing, China
Yongqian Sun
Yongqian Sun
Nankai University
AIOpsAnomaly DetectionFailure LocalizationMicroservices Fault DiagnosisRoot Cause Analysis
Shenglin Zhang
Shenglin Zhang
Nankai University
AI Operations in general
J
Jielong Huang
Kuaishou Technology, Hangzhou, China
Nan Qi
Nan Qi
IEEE Senior Member
Anti-jammingGame theory,Optimization theoryUAV CommunicationsRIS
Dan Pei
Dan Pei
Associate Professor of Computer Science, Tsinghua University
AIOpsTime Series Intelligence