🤖 AI Summary
In large-scale cloud systems, manual event management is labor-intensive and error-prone, while existing automation approaches suffer from poor generalizability, limited interpretability, and high deployment overhead. To address these challenges, this paper proposes OpsAgent—a lightweight multi-agent system for automated root-cause analysis. Our method introduces three key innovations: (1) a training-free structural text transformer that unifies heterogeneous observability data (e.g., logs, metrics, traces) into semantically coherent natural language; (2) an interpretable and auditable diagnostic reasoning framework grounded in collaborative multi-agent deliberation; and (3) a dual-loop self-evolution mechanism integrating in-model adaptation with external experience accumulation. OpsAgent enables cross-platform transferability and low-cost closed-loop deployment. Evaluated on the OPENRCA benchmark, it achieves state-of-the-art performance, significantly improving generalizability, diagnostic transparency, deployment efficiency, and long-term operational sustainability.
📝 Abstract
Incident management (IM) is central to the reliability of large-scale cloud systems. Yet manual IM, where on-call engineers examine metrics, logs, and traces is labor-intensive and error-prone in the face of massive and heterogeneous observability data. Existing automated IM approaches often struggle to generalize across systems, provide limited interpretability, and incur high deployment costs, which hinders adoption in practice. In this paper, we present OpsAgent, a lightweight, self-evolving multi-agent system for IM that employs a training-free data processor to convert heterogeneous observability data into structured textual descriptions, along with a multi-agent collaboration framework that makes diagnostic inference transparent and auditable. To support continual capability growth, OpsAgent also introduces a dual self-evolution mechanism that integrates internal model updates with external experience accumulation, thereby closing the deployment loop. Comprehensive experiments on the OPENRCA benchmark demonstrate state-of-the-art performance and show that OpsAgent is generalizable, interpretable, cost-efficient, and self-evolving, making it a practically deployable and sustainable solution for long-term operation in real-world cloud systems.