🤖 AI Summary
Modern software systems’ increasing complexity imposes excessive operational burdens on Site Reliability Engineering (SRE) teams, while existing AI-based solutions suffer from insufficient causal reasoning depth and poor integration with SRE workflows. To address these challenges, this paper introduces DiagAgent—the first open-source, multi-agent diagnostic framework specifically designed for SRE. Its core contributions are threefold: (1) a diagnosis-native collaborative architecture enabling coordinated reasoning among domain-specialized agents; (2) a pluggable causal inference engine tightly coupled with a knowledge engine to support root-cause analysis; and (3) a modular communication infrastructure built upon the standardized Model Context Protocol (MCP). Extensive experiments demonstrate that DiagAgent significantly outperforms baseline methods in both fault localization accuracy and response latency. Deployed at scale in Ant Group’s production environment, it serves over 3,000 users daily, validating its industrial-grade scalability and practical utility.
📝 Abstract
The escalating complexity of modern software imposes an unsustainable operational burden on Site Reliability Engineering (SRE) teams, demanding AI-driven automation that can emulate expert diagnostic reasoning. Existing solutions, from traditional AI methods to general-purpose multi-agent systems, fall short: they either lack deep causal reasoning or are not tailored for the specialized, investigative workflows unique to SRE. To address this gap, we present OpenDerisk, a specialized, open-source multi-agent framework architected for SRE. OpenDerisk integrates a diagnostic-native collaboration model, a pluggable reasoning engine, a knowledge engine, and a standardized protocol (MCP) to enable specialist agents to collectively solve complex, multi-domain problems. Our comprehensive evaluation demonstrates that OpenDerisk significantly outperforms state-of-the-art baselines in both accuracy and efficiency. This effectiveness is validated by its large-scale production deployment at Ant Group, where it serves over 3,000 daily users across diverse scenarios, confirming its industrial-grade scalability and practical impact. OpenDerisk is open source and available at https://github.com/derisk-ai/OpenDerisk/