Cloud-OpsBench: A Reproducible Benchmark for Agentic Root Cause Analysis in Cloud Systems

📅 2026-02-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing root cause analysis benchmarks struggle to balance ecological validity with reproducibility, limiting their ability to effectively evaluate agents’ proactive reasoning capabilities in cloud systems. To address this gap, this work proposes a reproducible digital twin benchmark for cloud systems based on a state-snapshot paradigm, encompassing 452 fault cases and 40 root cause categories across the full Kubernetes stack. The benchmark innovatively integrates a data engine, a reinforcement learning environment, and diagnostic criteria into a unified framework, enabling both supervised fine-tuning and policy optimization. It further provides a process-oriented diagnostic protocol and a secure sandbox, offering high-quality infrastructure for training and evaluating small language models and multi-agent systems in Site Reliability Engineering (SRE) scenarios.

Technology Category

Application Category

📝 Abstract
The transition to agentic Root Cause Analysis (RCA) necessitates benchmarks that evaluate active reasoning rather than passive classification. However, current frameworks fail to reconcile ecological validity with reproducibility. We introduce Cloud-OpsBench, a large-scale benchmark that employs a State Snapshot Paradigm to construct a deterministic digital twin of the cloud, featuring 452 distinct fault cases across 40 root cause types spanning the full Kubernetes stack. Crucially, Cloud-OpsBench serves as an enabling infrastructure for next-generation SRE research: (1) As a Data Engine, it harvests high-quality reasoning trajectories to bootstrap Supervised Fine-Tuning (SFT) for Small Language Models; (2) As an Reinforcement Learning (RL) environment, it transforms high-risk operations into a safe low-latency sandbox for training policy optimization agents; and (3) As a Diagnostic Standard, its process-centric protocol uncovers architectural bottlenecks guiding the design of robust specialized multi-agent system for RCA.
Problem

Research questions and friction points this paper is trying to address.

Root Cause Analysis
Cloud Systems
Benchmark
Reproducibility
Ecological Validity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic Root Cause Analysis
State Snapshot Paradigm
Digital Twin
Supervised Fine-Tuning
Reinforcement Learning Environment
🔎 Similar Papers
No similar papers found.