Comparing AI Coding Agents: A Task-Stratified Analysis of Pull Request Acceptance

📅 2026-02-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of systematic comparison of AI programming assistants across different task types and temporal dimensions. Leveraging 7,156 real-world pull requests from the AIDev dataset, we conduct a hierarchical evaluation of five leading AI coding agents, integrating task categorization, time-series trend modeling, and chi-square tests. Our analysis reveals, for the first time, that task type exerts a substantially greater influence on pull request acceptance rates than model differences. Documentation tasks achieve a significantly higher acceptance rate (82.1%) compared to new feature development (66.1%). No single agent dominates across all tasks; instead, performance is complementary. Notably, Devin is the only agent exhibiting a statistically significant positive trend in acceptance rates over time, while Claude Code and Cursor lead in documentation/new features and bug-fixing tasks, respectively.

Technology Category

Application Category

📝 Abstract
The rapid adoption of AI-powered coding assistants is transforming software development practices, yet systematic comparisons of their effectiveness across different task types and over time remain limited. This paper presents an empirical study comparing five popular agents (OpenAI Codex, GitHub Copilot, Devin, Cursor, and Claude Code), analyzing 7,156 pull requests (PRs) from the AIDev dataset. Temporal trend analysis reveals heterogeneous evolution patterns: Devin exhibits the only consistent positive trend in acceptance rate (+0.77% per week over 32 weeks), whereas other agents remain largely stable. Our analysis suggests that the PR task type is a dominant factor influencing acceptance rates: documentation tasks achieve 82.1% acceptance compared to 66.1% for new features - a 16 percentage point gap that exceeds typical inter-agent variance for most tasks. OpenAI Codex achieves consistently high acceptance rates across all nine task categories (59.6%-88.6%), with stratified Chi-square tests confirming statistically significant advantages over other agents in several task categories. However, no single agent performs best across all task types: Claude Code leads in documentation (92.3%) and features (72.6%), while Cursor excels in fix tasks (80.4%).
Problem

Research questions and friction points this paper is trying to address.

AI coding agents
pull request acceptance
task stratification
empirical comparison
software development
Innovation

Methods, ideas, or system contributions that make the work stand out.

task-stratified analysis
AI coding agents
pull request acceptance
temporal trend analysis
empirical evaluation
🔎 Similar Papers
No similar papers found.