OpenDerisk: An Industrial Framework for AI-Driven SRE, with Design, Implementation, and Case Studies

📅 2025-10-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Modern software systems’ increasing complexity imposes excessive operational burdens on Site Reliability Engineering (SRE) teams, while existing AI-based solutions suffer from insufficient causal reasoning depth and poor integration with SRE workflows. To address these challenges, this paper introduces DiagAgent—the first open-source, multi-agent diagnostic framework specifically designed for SRE. Its core contributions are threefold: (1) a diagnosis-native collaborative architecture enabling coordinated reasoning among domain-specialized agents; (2) a pluggable causal inference engine tightly coupled with a knowledge engine to support root-cause analysis; and (3) a modular communication infrastructure built upon the standardized Model Context Protocol (MCP). Extensive experiments demonstrate that DiagAgent significantly outperforms baseline methods in both fault localization accuracy and response latency. Deployed at scale in Ant Group’s production environment, it serves over 3,000 users daily, validating its industrial-grade scalability and practical utility.

Technology Category

Application Category

📝 Abstract
The escalating complexity of modern software imposes an unsustainable operational burden on Site Reliability Engineering (SRE) teams, demanding AI-driven automation that can emulate expert diagnostic reasoning. Existing solutions, from traditional AI methods to general-purpose multi-agent systems, fall short: they either lack deep causal reasoning or are not tailored for the specialized, investigative workflows unique to SRE. To address this gap, we present OpenDerisk, a specialized, open-source multi-agent framework architected for SRE. OpenDerisk integrates a diagnostic-native collaboration model, a pluggable reasoning engine, a knowledge engine, and a standardized protocol (MCP) to enable specialist agents to collectively solve complex, multi-domain problems. Our comprehensive evaluation demonstrates that OpenDerisk significantly outperforms state-of-the-art baselines in both accuracy and efficiency. This effectiveness is validated by its large-scale production deployment at Ant Group, where it serves over 3,000 daily users across diverse scenarios, confirming its industrial-grade scalability and practical impact. OpenDerisk is open source and available at https://github.com/derisk-ai/OpenDerisk/
Problem

Research questions and friction points this paper is trying to address.

Addresses AI-driven automation for complex SRE operations
Overcomes limitations of existing non-specialized multi-agent systems
Provides industrial framework for expert diagnostic reasoning workflows
Innovation

Methods, ideas, or system contributions that make the work stand out.

OpenDerisk is a specialized multi-agent framework for SRE
It integrates diagnostic-native collaboration and pluggable reasoning engines
Framework enables agents to solve complex multi-domain problems
🔎 Similar Papers
No similar papers found.
Peng Di
Peng Di
Senior Staff Engineer at Ant Group; Adjunct Associate Professor at UNSW Sydney
Parallel ComputingProgramming LanguageCompilerSoftware Engineering
F
Faqiang Chen
Ant Group, China
Xiao Bai
Xiao Bai
Professor of Computer Science, Beihang University
pattern recognitioncomputer vision
H
Hongjun Yang
Ant Group, China
Q
Qingfeng Li
Ant Group, China
G
Ganglin Wei
Ant Group, China
J
Jian Mou
Ant Group, China
F
Feng Shi
Ant Group, China
K
Keting Chen
Ant Group, China
Peng Tang
Peng Tang
Meta
Multi-modal LLMVision LanguageComputer Vision
Zhitao Shen
Zhitao Shen
Ant Group
databasedata storage
Z
Zheng Li
Ant Group, China
W
Wenhui Shi
Ant Group, China
J
Junwei Guo
Ant Group, China
H
Hang Yu
Ant Group, China