STRATUS: A Multi-agent System for Autonomous Reliability Engineering of Modern Clouds

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Frequent failures in cloud environments overwhelm traditional human-led Site Reliability Engineering (SRE), especially given escalating scale and system complexity. Method: We propose the first end-to-end autonomous reliability engineering framework for production-grade cloud services. It leverages large language models (LLMs) to orchestrate a multi-agent system, integrating a state-machine-driven SRE agent collaboration architecture with our novel Transactional No-Regression (TNR) safety specification—ensuring formally verifiable safety throughout diagnosis and self-healing. The framework unifies formal safety constraints with collaborative reasoning mechanisms. Contribution/Results: Evaluated comprehensively on AIOpsLab and ITBench benchmarks, our approach achieves ≥1.5× higher failure mitigation success rate compared to existing SRE agents. It significantly improves practicality and deployment feasibility while guaranteeing rigorous safety guarantees in autonomous operations.

Technology Category

Application Category

📝 Abstract
In cloud-scale systems, failures are the norm. A distributed computing cluster exhibits hundreds of machine failures and thousands of disk failures; software bugs and misconfigurations are reported to be more frequent. The demand for autonomous, AI-driven reliability engineering continues to grow, as existing humanin-the-loop practices can hardly keep up with the scale of modern clouds. This paper presents STRATUS, an LLM-based multi-agent system for realizing autonomous Site Reliability Engineering (SRE) of cloud services. STRATUS consists of multiple specialized agents (e.g., for failure detection, diagnosis, mitigation), organized in a state machine to assist system-level safety reasoning and enforcement. We formalize a key safety specification of agentic SRE systems like STRATUS, termed Transactional No-Regression (TNR), which enables safe exploration and iteration. We show that TNR can effectively improve autonomous failure mitigation. STRATUS significantly outperforms state-of-the-art SRE agents in terms of success rate of failure mitigation problems in AIOpsLab and ITBench (two SRE benchmark suites), by at least 1.5 times across various models. STRATUS shows a promising path toward practical deployment of agentic systems for cloud reliability.
Problem

Research questions and friction points this paper is trying to address.

Autonomous reliability engineering for cloud-scale systems
Mitigating frequent failures in distributed computing clusters
Improving failure detection, diagnosis, and mitigation with AI
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based multi-agent system for SRE
Transactional No-Regression safety specification
Specialized agents in state machine
🔎 Similar Papers
No similar papers found.