Why Are Agentic Pull Requests Merged or Rejected? An Empirical Study

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Current evaluations of AI programming agents rely solely on whether pull requests (PRs) are merged or rejected, overlooking the critical influence of code review interactions on these outcomes. Through manual inspection of 9,799 agent-submitted PRs and in-depth analysis of 717 representative cases, this study reveals for the first time that PR outcomes poorly reflect agents’ true capabilities: only 35.7% of rejected PRs stem from clear agent errors, while over 60% result from workflow constraints or insufficient contextual information for sound decision-making; notably, some merged PRs required reviewer intervention to become acceptable. The work advocates for a fine-grained, interaction-aware evaluation paradigm grounded in code review dynamics and uncovers systematic differences in collaborative behaviors across agents.

📝 Abstract

AI coding agents increasingly submit pull requests (Agentic-PRs) to open-source repositories, yet their performance is commonly assessed using merge and rejection outcomes alone. We hypothesized that these outcome labels do not reliably reflect agent capability without considering review interactions. To test this, we conducted a decision-oriented analysis of 11,048 closed Agentic Pull Requests, refined to 9,799 human-reviewed PRs, and manually inspected 717 representative cases to recover decision rationale from interaction artifacts. We found that rejection outcomes substantially overstate agent error: only 35.7% of rejected PRs reflected clear agentic failures, while 31.2% were driven by workflow constraints and 33.1% lacked observable decision rationale. Among merged PRs, 15.4% required explicit reviewer involvement through feedback or direct commits, and 5.5% showed no visible interaction trace. We further observed systematic differences across agents, with Copilot and Devin more often embedded in reviewer-mediated workflows, while Codex and Cursor PRs were typically merged with minimal interaction. These results reject the assumption that PR outcomes alone capture agent performance and demonstrate the need for interaction-aware evaluation grounded in review behavior.

Problem

Research questions and friction points this paper is trying to address.

Agentic Pull Requests

Code Review

AI Coding Agents

Merge Decision

Interaction Artifacts

Innovation

Methods, ideas, or system contributions that make the work stand out.

agentic pull requests

code review interaction

empirical study