🤖 AI Summary
This work addresses the lack of empirical evaluation of large language models’ (LLMs) ability to generate executable IT automation scripts—particularly for Ansible. We introduce ITAB, the first real-world benchmark comprising 126 state-calibrated tasks, and pioneer state reconciliation as a core evaluation dimension. Using dynamic execution validation, failure attribution analysis, and Ansible sandbox testing, we conduct a horizontal evaluation of 14 open-source LLMs. Results show a maximum pass@10 of only 12%, with two dominant error categories identified: state reasoning failures (44.87%) and deficits in Ansible module–specific knowledge (24.37%). The study reveals that state tracking and domain-specific execution understanding constitute critical bottlenecks for LLMs in IT automation. Our findings provide empirically grounded insights and methodological foundations for both LLM improvement and future benchmark development.
📝 Abstract
LLMs show promise in code generation, yet their effectiveness for IT automation tasks, particularly for tools like Ansible, remains understudied. Existing benchmarks rely primarily on synthetic tasks that fail to capture the needs of practitioners who use IT automation tools, such as Ansible. We present ITAB (IT Automation Task Benchmark), a benchmark of 126 diverse tasks (e.g., configuring servers, managing files) where each task accounts for state reconciliation: a property unique to IT automation tools. ITAB evaluates LLMs' ability to generate functional Ansible automation scripts via dynamic execution in controlled environments. We evaluate 14 open-source LLMs, none of which accomplish pass@10 at a rate beyond 12%. To explain these low scores, we analyze 1,411 execution failures across the evaluated LLMs and identify two main categories of prevalent semantic errors: failures in state reconciliation related reasoning (44.87% combined from variable (11.43%), host (11.84%), path(11.63%), and template (9.97%) issues) and deficiencies in module-specific execution knowledge (24.37% combined from Attribute and parameter (14.44%) and module (9.93%) errors). Our findings reveal key limitations in open-source LLMs' ability to track state changes and apply specialized module knowledge, indicating that reliable IT automation will require major advances in state reasoning and domain-specific execution understanding.