DevOps-Gym: Benchmarking AI Agents in Software DevOps Cycle

📅 2026-01-27

📈 Citations: 1

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This study addresses the lack of systematic evaluation of AI agents’ capabilities across the full DevOps lifecycle. To this end, it introduces the first end-to-end DevOps benchmark, encompassing four core workflows: build configuration, monitoring, bug repair, and test generation. The benchmark is grounded in a large-scale open-source dataset comprising over 700 tasks derived from more than 30 real-world Java and Go projects. It employs a semi-automated data collection pipeline augmented with expert validation, integrates domain-specific tool interfaces and dynamic program analysis, and supports multi-language execution environments. Evaluation results reveal that current state-of-the-art AI models exhibit limited performance in bug repair and test generation and struggle significantly with emerging tasks such as monitoring and build configuration, thereby exposing fundamental limitations in their ability to automate comprehensive DevOps processes.

Technology Category

Application Category

📝 Abstract

Even though demonstrating extraordinary capabilities in code generation and software issue resolving, AI agents'capabilities in the full software DevOps cycle are still unknown. Different from pure code generation, handling the DevOps cycle in real-world software, including developing, deploying, and managing, requires analyzing large-scale projects, understanding dynamic program behaviors, leveraging domain-specific tools, and making sequential decisions. However, existing benchmarks focus on isolated problems and lack environments and tool interfaces for DevOps. We introduce DevOps-Gym, the first end-to-end benchmark for evaluating AI agents across core DevOps workflows: build and configuration, monitoring, issue resolving, and test generation. DevOps-Gym includes 700+ real-world tasks collected from 30+ projects in Java and Go. We develop a semi-automated data collection mechanism with rigorous and non-trivial expert efforts in ensuring the task coverage and quality. Our evaluation of state-of-the-art models and agents reveals fundamental limitations: they struggle with issue resolving and test generation in Java and Go, and remain unable to handle new tasks such as monitoring and build and configuration. These results highlight the need for essential research in automating the full DevOps cycle with AI agents.

Problem

Research questions and friction points this paper is trying to address.

DevOps

AI agents

benchmark

software engineering

end-to-end evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

DevOps-Gym

AI agents

end-to-end benchmark