Reliability of AI Bots Footprints in GitHub Actions CI/CD Workflows

📅 2026-04-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

190K/year
🤖 AI Summary
This study presents the first systematic evaluation of reliability differences among multiple AI agents—Claude, Devin, Cursor, Copilot, and Codex—in GitHub Actions CI/CD workflows. Leveraging the AIDev dataset, the authors collected 61,837 workflow runs via the GitHub Actions API and integrated CI logs, pull request metadata, and commit data to construct a taxonomy of 13 distinct failure causes. Their analysis reveals that Copilot and Codex achieve the highest success rates (93%–94%), while the frequency of AI contributions exhibits a significant negative correlation with workflow success. Moreover, high-frequency AI involvement is associated with an increased likelihood of specific failure types. These findings provide empirical evidence and a practical framework for integrating AI-generated code into CI/CD pipelines, particularly in high-stakes development scenarios.

Technology Category

Application Category

📝 Abstract
Continuous Integration and Deployment (CI/CD) workflows are central to modern software delivery, yet the reliability of agentic AI bots operating within these workflows remain underexplored. Using pull requests (PRs), commits, and repositories from the AIDev dataset, we retrieved associated CI/CD workflow runs via the GitHub Actions API and analyzed 61,837 runs from 2,355 repositories, all triggered by PRs generated by five AI bots: Claude, Devin, Cursor, Copilot, and Codex. We observed substantial agent-dependent differences in workflow reliability, with Copilot and Codex achieving the highest success rates ~93% and ~94% respectively. At the repository level, we find a negative correlation between AI agent contribution frequency and workflow success rate, suggesting that a higher frequency of Agentic PRs may hinder CI/CD workflow reliability. We defined a taxonomy of 13 categories against 3,067 agentic PRs whose associated workflows failed, and observed a trend analysis that indicates visually observable shifts from functional to non-functional PR categories over time, although these trends are not statistically significant. Our findings motivate the need for actionable guidance on integrating AI agents into CI/CD workflows and prioritizing safeguards in workflows where failures are most likely to occur.
Problem

Research questions and friction points this paper is trying to address.

AI Bots
CI/CD Workflows
Reliability
GitHub Actions
Agentic PRs
Innovation

Methods, ideas, or system contributions that make the work stand out.

AI agents
CI/CD reliability
GitHub Actions
failure taxonomy
agentic software development
🔎 Similar Papers
2024-07-18arXiv.orgCitations: 1