Cloud-OpsBench: A Reproducible Benchmark for Agentic Root Cause Analysis in Cloud Systems

📅 2026-02-28

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Existing root cause analysis benchmarks struggle to balance ecological validity with reproducibility, limiting their ability to effectively evaluate agents’ proactive reasoning capabilities in cloud systems. To address this gap, this work proposes a reproducible digital twin benchmark for cloud systems based on a state-snapshot paradigm, encompassing 452 fault cases and 40 root cause categories across the full Kubernetes stack. The benchmark innovatively integrates a data engine, a reinforcement learning environment, and diagnostic criteria into a unified framework, enabling both supervised fine-tuning and policy optimization. It further provides a process-oriented diagnostic protocol and a secure sandbox, offering high-quality infrastructure for training and evaluating small language models and multi-agent systems in Site Reliability Engineering (SRE) scenarios.

Technology Category

Application Category

📝 Abstract

The transition to agentic Root Cause Analysis (RCA) necessitates benchmarks that evaluate active reasoning rather than passive classification. However, current frameworks fail to reconcile ecological validity with reproducibility. We introduce Cloud-OpsBench, a large-scale benchmark that employs a State Snapshot Paradigm to construct a deterministic digital twin of the cloud, featuring 452 distinct fault cases across 40 root cause types spanning the full Kubernetes stack. Crucially, Cloud-OpsBench serves as an enabling infrastructure for next-generation SRE research: (1) As a Data Engine, it harvests high-quality reasoning trajectories to bootstrap Supervised Fine-Tuning (SFT) for Small Language Models; (2) As an Reinforcement Learning (RL) environment, it transforms high-risk operations into a safe low-latency sandbox for training policy optimization agents; and (3) As a Diagnostic Standard, its process-centric protocol uncovers architectural bottlenecks guiding the design of robust specialized multi-agent system for RCA.

Problem

Research questions and friction points this paper is trying to address.

Root Cause Analysis

Cloud Systems

Benchmark

Reproducibility

Ecological Validity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic Root Cause Analysis

State Snapshot Paradigm

Digital Twin