OpenDerisk: An Industrial Framework for AI-Driven SRE, with Design, Implementation, and Case Studies

📅 2025-10-15

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Modern software systems’ increasing complexity imposes excessive operational burdens on Site Reliability Engineering (SRE) teams, while existing AI-based solutions suffer from insufficient causal reasoning depth and poor integration with SRE workflows. To address these challenges, this paper introduces DiagAgent—the first open-source, multi-agent diagnostic framework specifically designed for SRE. Its core contributions are threefold: (1) a diagnosis-native collaborative architecture enabling coordinated reasoning among domain-specialized agents; (2) a pluggable causal inference engine tightly coupled with a knowledge engine to support root-cause analysis; and (3) a modular communication infrastructure built upon the standardized Model Context Protocol (MCP). Extensive experiments demonstrate that DiagAgent significantly outperforms baseline methods in both fault localization accuracy and response latency. Deployed at scale in Ant Group’s production environment, it serves over 3,000 users daily, validating its industrial-grade scalability and practical utility.

Technology Category

Application Category

📝 Abstract

The escalating complexity of modern software imposes an unsustainable operational burden on Site Reliability Engineering (SRE) teams, demanding AI-driven automation that can emulate expert diagnostic reasoning. Existing solutions, from traditional AI methods to general-purpose multi-agent systems, fall short: they either lack deep causal reasoning or are not tailored for the specialized, investigative workflows unique to SRE. To address this gap, we present OpenDerisk, a specialized, open-source multi-agent framework architected for SRE. OpenDerisk integrates a diagnostic-native collaboration model, a pluggable reasoning engine, a knowledge engine, and a standardized protocol (MCP) to enable specialist agents to collectively solve complex, multi-domain problems. Our comprehensive evaluation demonstrates that OpenDerisk significantly outperforms state-of-the-art baselines in both accuracy and efficiency. This effectiveness is validated by its large-scale production deployment at Ant Group, where it serves over 3,000 daily users across diverse scenarios, confirming its industrial-grade scalability and practical impact. OpenDerisk is open source and available at https://github.com/derisk-ai/OpenDerisk/

Problem

Research questions and friction points this paper is trying to address.

Addresses AI-driven automation for complex SRE operations

Overcomes limitations of existing non-specialized multi-agent systems

Provides industrial framework for expert diagnostic reasoning workflows

Innovation

Methods, ideas, or system contributions that make the work stand out.

OpenDerisk is a specialized multi-agent framework for SRE

It integrates diagnostic-native collaboration and pluggable reasoning engines

Framework enables agents to solve complex multi-domain problems

🔎 Similar Papers

No similar papers found.