SREGym: A Live Benchmark for AI SRE Agents with High-Fidelity Failure Scenarios

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

Existing SRE benchmark tasks are overly simplified and fail to capture the complexity of fault diagnosis and mitigation in real-world production environments. This work proposes the first high-fidelity, scalable evaluation benchmark for SRE agents, built upon a realistic cloud-native system stack that dynamically simulates operational conditions. The benchmark incorporates a fault injector and noise simulator to support diverse failure modes—including metastable and correlated failures—and provides 90 realistic, challenging tasks. Designed with a modular architecture, it enables continuous extension and adaptation. Experimental results demonstrate significant performance disparities among state-of-the-art AI agents across different fault types, with end-to-end success rates varying by up to 40%, thereby validating the benchmark’s effectiveness and inherent difficulty.

📝 Abstract

AI agents are increasingly used to diagnose and mitigate failures in production systems, known as agentic Site Reliability Engineering (SRE). Current SRE benchmarks are limited to oversimplistic SRE tasks and are unfortunately hard to extend due to bespoke designs. We present SREGym, a high-fidelity benchmark for SRE agents. SREGym exposes a live system environment built atop real-world cloud-native system stacks, where high-fidelity failure scenarios are simulated through fault injectors. SREGym models the complexity of production environments by simulating (1) a wide range of faults at different layers, (2) various ambient noises, and (3) diverse failure modes such as metastable failures and correlated failures. SREGym is architected as a modular, extensible framework that orchestrates fault and noise injectors across stacks. SREGym currently includes 90 realistic, challenging SRE problems. We use SREGym to evaluate frontier agents and show that their capabilities varies significantly in addressing different kinds of failures, with up to 40% differences in end-to-end results. SREGym is actively maintained as an open-source project and has been used by researchers and practitioners.

Problem

Research questions and friction points this paper is trying to address.

SRE

AI agents

benchmark

failure scenarios

cloud-native systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

SREGym

AI SRE agents

high-fidelity failure simulation