🤖 AI Summary
Although repository-scale automated program repair (RAG-APR) has shown improved performance with enhanced fault localization, substantial room for recoverable gains and a residual frontier remain. This work proposes a unified evaluation protocol to systematically analyze key factors beyond localization that influence repair effectiveness, including candidate diversity, contextual evidence quality, and interface design. Through techniques such as oracle localization, Best-of-K sampling, fixed-interface probing, in-repo hard negatives, and universal wrapper validation, experiments reveal that even with perfect localization, the success rates of three major APR systems remain below 50%; gains from candidate diversity saturate rapidly; high-quality context substantially boosts repair performance; KGCompass and ExpeRepair excel under a universal wrapper; and the optimal probe yields only six additional correct repairs, highlighting fundamental bottlenecks in current approaches.
📝 Abstract
Repository-level automated program repair (APR) increasingly treats stronger localization as the main path to better repair. We ask a more targeted question: once localization is strengthened, which post-localization levers still provide recoverable gains, which are bounded within our protocol, and what residual frontier remains? We study this question on SWE-bench Lite with three representative repository-level RAG-APR paradigms, Agentless, KGCompass, and ExpeRepair. Our protocol combines Oracle Localization, within-pool Best-of-K, fixed-interface added context probes with per-condition same-token filler controls and same-repository hard negatives, and a common-wrapper oracle check. Oracle Localization improves all three systems, but Oracle success still stays below 50%. Extra candidate diversity still helps inside the sampled 10-patch pools, but that headroom saturates quickly. Under the two fixed interfaces, most informative added context conditions still outperform their own matched controls. The common-wrapper check shows different system responses: under a common wrapper, gains remain large for KGCompass and ExpeRepair, while Agentless changes more with builder choice. Prompt-level fusion still leaves a large residual frontier: the best fixed probe adds only 6 solved instances beyond the native three-system Solved@10 union. Overall, stronger localization, bounded search, evidence quality, and interface design all shape repository-level repair outcomes.