🤖 AI Summary
Existing root cause analysis benchmarks struggle to balance ecological validity with reproducibility, limiting their ability to effectively evaluate agents’ proactive reasoning capabilities in cloud systems. To address this gap, this work proposes a reproducible digital twin benchmark for cloud systems based on a state-snapshot paradigm, encompassing 452 fault cases and 40 root cause categories across the full Kubernetes stack. The benchmark innovatively integrates a data engine, a reinforcement learning environment, and diagnostic criteria into a unified framework, enabling both supervised fine-tuning and policy optimization. It further provides a process-oriented diagnostic protocol and a secure sandbox, offering high-quality infrastructure for training and evaluating small language models and multi-agent systems in Site Reliability Engineering (SRE) scenarios.
📝 Abstract
The transition to agentic Root Cause Analysis (RCA) necessitates benchmarks that evaluate active reasoning rather than passive classification. However, current frameworks fail to reconcile ecological validity with reproducibility. We introduce Cloud-OpsBench, a large-scale benchmark that employs a State Snapshot Paradigm to construct a deterministic digital twin of the cloud, featuring 452 distinct fault cases across 40 root cause types spanning the full Kubernetes stack. Crucially, Cloud-OpsBench serves as an enabling infrastructure for next-generation SRE research: (1) As a Data Engine, it harvests high-quality reasoning trajectories to bootstrap Supervised Fine-Tuning (SFT) for Small Language Models; (2) As an Reinforcement Learning (RL) environment, it transforms high-risk operations into a safe low-latency sandbox for training policy optimization agents; and (3) As a Diagnostic Standard, its process-centric protocol uncovers architectural bottlenecks guiding the design of robust specialized multi-agent system for RCA.