🤖 AI Summary
Existing AI agent evaluation frameworks lack realistic benchmarks for production IT automation tasks, particularly in critical domains such as Site Reliability Engineering (SRE), Chief Information Security Officer (CISO) operations, and Financial Operations (FinOps).
Method: We introduce ITBench—the first domain-specific benchmark framework covering these three pillars—comprising 94 reproducible, scalable real-world scenarios. It establishes a structured evaluation taxonomy spanning reliability, security compliance, and financial operational efficiency, supporting community-driven extension and end-to-end automated assessment. Our LLM-based evaluation infrastructure integrates task orchestration, sandboxed execution, and quantitative multi-dimensional metrics (correctness, security, timeliness).
Contribution/Results: Empirical evaluation reveals severe capability gaps: state-of-the-art models achieve only 13.8%, 25.2%, and 0% success rates on SRE, CISO, and FinOps tasks, respectively. ITBench provides the first systematic diagnosis of AI agents’ limitations in mission-critical IT operations, delivering a reproducible benchmark and actionable insights for future research.
📝 Abstract
Realizing the vision of using AI agents to automate critical IT tasks depends on the ability to measure and understand effectiveness of proposed solutions. We introduce ITBench, a framework that offers a systematic methodology for benchmarking AI agents to address real-world IT automation tasks. Our initial release targets three key areas: Site Reliability Engineering (SRE), Compliance and Security Operations (CISO), and Financial Operations (FinOps). The design enables AI researchers to understand the challenges and opportunities of AI agents for IT automation with push-button workflows and interpretable metrics. ITBench includes an initial set of 94 real-world scenarios, which can be easily extended by community contributions. Our results show that agents powered by state-of-the-art models resolve only 13.8% of SRE scenarios, 25.2% of CISO scenarios, and 0% of FinOps scenarios. We expect ITBench to be a key enabler of AI-driven IT automation that is correct, safe, and fast.