DevOps-Gym: Benchmarking AI Agents in Software DevOps Cycle

πŸ“… 2026-01-27
πŸ“ˆ Citations: 1
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study addresses the lack of systematic evaluation of AI agents’ capabilities across the full DevOps lifecycle. To this end, it introduces the first end-to-end DevOps benchmark, encompassing four core workflows: build configuration, monitoring, bug repair, and test generation. The benchmark is grounded in a large-scale open-source dataset comprising over 700 tasks derived from more than 30 real-world Java and Go projects. It employs a semi-automated data collection pipeline augmented with expert validation, integrates domain-specific tool interfaces and dynamic program analysis, and supports multi-language execution environments. Evaluation results reveal that current state-of-the-art AI models exhibit limited performance in bug repair and test generation and struggle significantly with emerging tasks such as monitoring and build configuration, thereby exposing fundamental limitations in their ability to automate comprehensive DevOps processes.

Technology Category

Application Category

πŸ“ Abstract
Even though demonstrating extraordinary capabilities in code generation and software issue resolving, AI agents'capabilities in the full software DevOps cycle are still unknown. Different from pure code generation, handling the DevOps cycle in real-world software, including developing, deploying, and managing, requires analyzing large-scale projects, understanding dynamic program behaviors, leveraging domain-specific tools, and making sequential decisions. However, existing benchmarks focus on isolated problems and lack environments and tool interfaces for DevOps. We introduce DevOps-Gym, the first end-to-end benchmark for evaluating AI agents across core DevOps workflows: build and configuration, monitoring, issue resolving, and test generation. DevOps-Gym includes 700+ real-world tasks collected from 30+ projects in Java and Go. We develop a semi-automated data collection mechanism with rigorous and non-trivial expert efforts in ensuring the task coverage and quality. Our evaluation of state-of-the-art models and agents reveals fundamental limitations: they struggle with issue resolving and test generation in Java and Go, and remain unable to handle new tasks such as monitoring and build and configuration. These results highlight the need for essential research in automating the full DevOps cycle with AI agents.
Problem

Research questions and friction points this paper is trying to address.

DevOps
AI agents
benchmark
software engineering
end-to-end evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

DevOps-Gym
AI agents
end-to-end benchmark
software DevOps cycle
real-world tasks
πŸ”Ž Similar Papers
No similar papers found.