From Observability Data to Diagnosis: An Evolving Multi-agent System for Incident Management in Cloud Systems

📅 2025-10-28

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

In large-scale cloud systems, manual event management is labor-intensive and error-prone, while existing automation approaches suffer from poor generalizability, limited interpretability, and high deployment overhead. To address these challenges, this paper proposes OpsAgent—a lightweight multi-agent system for automated root-cause analysis. Our method introduces three key innovations: (1) a training-free structural text transformer that unifies heterogeneous observability data (e.g., logs, metrics, traces) into semantically coherent natural language; (2) an interpretable and auditable diagnostic reasoning framework grounded in collaborative multi-agent deliberation; and (3) a dual-loop self-evolution mechanism integrating in-model adaptation with external experience accumulation. OpsAgent enables cross-platform transferability and low-cost closed-loop deployment. Evaluated on the OPENRCA benchmark, it achieves state-of-the-art performance, significantly improving generalizability, diagnostic transparency, deployment efficiency, and long-term operational sustainability.

Technology Category

Application Category

📝 Abstract

Incident management (IM) is central to the reliability of large-scale cloud systems. Yet manual IM, where on-call engineers examine metrics, logs, and traces is labor-intensive and error-prone in the face of massive and heterogeneous observability data. Existing automated IM approaches often struggle to generalize across systems, provide limited interpretability, and incur high deployment costs, which hinders adoption in practice. In this paper, we present OpsAgent, a lightweight, self-evolving multi-agent system for IM that employs a training-free data processor to convert heterogeneous observability data into structured textual descriptions, along with a multi-agent collaboration framework that makes diagnostic inference transparent and auditable. To support continual capability growth, OpsAgent also introduces a dual self-evolution mechanism that integrates internal model updates with external experience accumulation, thereby closing the deployment loop. Comprehensive experiments on the OPENRCA benchmark demonstrate state-of-the-art performance and show that OpsAgent is generalizable, interpretable, cost-efficient, and self-evolving, making it a practically deployable and sustainable solution for long-term operation in real-world cloud systems.

Problem

Research questions and friction points this paper is trying to address.

Automating incident diagnosis from heterogeneous cloud observability data

Overcoming limited generalization and interpretability in existing solutions

Reducing deployment costs while enabling continuous system evolution

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free data processor converts observability data to text

Multi-agent collaboration framework ensures transparent diagnostic inference

Dual self-evolution mechanism enables continuous capability growth

🔎 Similar Papers

No similar papers found.