ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks

📅 2025-02-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing AI agent evaluation frameworks lack realistic benchmarks for production IT automation tasks, particularly in critical domains such as Site Reliability Engineering (SRE), Chief Information Security Officer (CISO) operations, and Financial Operations (FinOps). Method: We introduce ITBench—the first domain-specific benchmark framework covering these three pillars—comprising 94 reproducible, scalable real-world scenarios. It establishes a structured evaluation taxonomy spanning reliability, security compliance, and financial operational efficiency, supporting community-driven extension and end-to-end automated assessment. Our LLM-based evaluation infrastructure integrates task orchestration, sandboxed execution, and quantitative multi-dimensional metrics (correctness, security, timeliness). Contribution/Results: Empirical evaluation reveals severe capability gaps: state-of-the-art models achieve only 13.8%, 25.2%, and 0% success rates on SRE, CISO, and FinOps tasks, respectively. ITBench provides the first systematic diagnosis of AI agents’ limitations in mission-critical IT operations, delivering a reproducible benchmark and actionable insights for future research.

Technology Category

Application Category

📝 Abstract
Realizing the vision of using AI agents to automate critical IT tasks depends on the ability to measure and understand effectiveness of proposed solutions. We introduce ITBench, a framework that offers a systematic methodology for benchmarking AI agents to address real-world IT automation tasks. Our initial release targets three key areas: Site Reliability Engineering (SRE), Compliance and Security Operations (CISO), and Financial Operations (FinOps). The design enables AI researchers to understand the challenges and opportunities of AI agents for IT automation with push-button workflows and interpretable metrics. ITBench includes an initial set of 94 real-world scenarios, which can be easily extended by community contributions. Our results show that agents powered by state-of-the-art models resolve only 13.8% of SRE scenarios, 25.2% of CISO scenarios, and 0% of FinOps scenarios. We expect ITBench to be a key enabler of AI-driven IT automation that is correct, safe, and fast.
Problem

Research questions and friction points this paper is trying to address.

Evaluating AI agents in IT automation
Benchmarking AI for real-world IT tasks
Assessing AI effectiveness in SRE, CISO, FinOps
Innovation

Methods, ideas, or system contributions that make the work stand out.

ITBench benchmarking AI agents
Push-button workflows for automation
94 real-world IT scenarios
🔎 Similar Papers
No similar papers found.
Saurabh Jha
Saurabh Jha
Sr. Research Scientist, IBM
ML for SystemsSystems for MLReliability
R
Rohan Arora
IBM
Yuji Watanabe
Yuji Watanabe
IBM Research - Tokyo
SecurityComplianceCloud
T
Takumi Yanagawa
IBM
Yinfang Chen
Yinfang Chen
University of Illinois Urbana-Champaign
System reliabilityML for SystemsSystems for MLSystem security
Jackson Clark
Jackson Clark
PhD Student
AIOpsSite Reliability EngineeringChaos EngineeringSystem Reliability
B
Bhavya Bhavya
IBM
M
Mudit Verma
IBM
Harshit Kumar
Harshit Kumar
Whiterabbit.ai, Inc.
Deep LearningSecurityHardware Security and Trust
H
Hirokuni Kitahara
IBM
N
Noah Zheutlin
IBM
S
Saki Takano
IBM
Divya Pathak
Divya Pathak
Indian Institute of Technology Hyderabad
Software Defined NetworkingIn-Network Systems and Security
F
Felix George
IBM
Xinbo Wu
Xinbo Wu
University of Illinois at Urbana-Champaign
B
Bekir O. Turkkan
IBM
G
Gerard Vanloo
IBM
Michael Nidd
Michael Nidd
IBM Research
Computer networksComputer securityMachine LearningCloud Operations
T
Ting Dai
IBM
Oishik Chatterjee
Oishik Chatterjee
IBM
Pranjal Gupta
Pranjal Gupta
IBM
Suranjana Samanta
Suranjana Samanta
Research Scientist
NLPMachine LearningComputer Vision
Pooja Aggarwal
Pooja Aggarwal
IBM Research
R
Rong Lee
IBM
P
Pavankumar Murali
IBM
Jae-wook Ahn
Jae-wook Ahn
Research Staff Member, IBM Research
PersonalizationAdaptive E-learningHuman-Computer InteractionAdaptive VisualizationPersonalized/Exploratory Search/RecSys
D
Debanjana Kar
IBM
A
Ameet Rahane
IBM
C
Carlos Fonseca
IBM
A
Amit Paradkar
IBM
Y
Yu Deng
IBM
P
Pratibha Moogi
IBM
Prateeti Mohapatra
Prateeti Mohapatra
IBM Research, Bangalore
N
Naoki Abe
IBM
C
Chandrasekhar Narayanaswami
IBM
Tianyin Xu
Tianyin Xu
University of Illinois at Urbana-Champaign
Software/system reliabilityOperating systemsDistributed systemsSoftware engineering
Lav R. Varshney
Lav R. Varshney
Stony Brook University
Artificial IntelligenceInformation TheorySignal ProcessingNeuroscienceNetwork Science
Ruchi Mahindru
Ruchi Mahindru
IBM T. J. Watson Research Center
A
Anca Sailer
IBM
L
Laura Shwartz
IBM
Daby Sow
Daby Sow
IBM T. J. Watson Research Center
DirectorAI for IT Automation
N
Nicholas C. M. Fuller
IBM
R
Ruchir Puri
IBM