🤖 AI Summary
This study addresses the unclear failure modes of large language models (LLMs) in real-world GitHub issue repair, which hinders their reliable deployment. The authors propose the first unified failure taxonomy spanning all five stages of the repair pipeline and conduct a manual error analysis of 243 failed cases from the SWE-bench Verified dataset across Claude 4.5 Sonnet, Gemini 3 Pro, and GPT-5. Their systematic diagnosis reveals that strategy formulation and logical synthesis are the weakest links, while existing evaluation frameworks may misclassify correct patches as failures. Notably, LLMs demonstrate stronger fault localization capabilities than traditionally assumed. The work further quantifies the distribution of failure rates across repair stages for each model, offering actionable insights for targeted improvements.
📝 Abstract
Large Language Models (LLMs) are increasingly deployed to resolve real-world GitHub issues. However, despite their potential, the specific failure modes of these models in complex repair tasks remain poorly understood. To characterize how LLM behavior diverges from human developer practices, this paper evaluates three state-of-the-art models, i.e., Claude 4.5 Sonnet, Gemini 3 Pro, and GPT-5, on the SWE-bench Verified dataset. We conduct a rigorous manual analysis of the symptoms and root causes underlying 243 failed attempts across 900 total trials. Our investigation first yields a unified failure taxonomy encompassing five distinct stages of the repair pipeline, within which we categorize typical failure symptoms and their prevalence. Secondly, our findings reveal that for all evaluated LLMs, strategy formulation and logic synthesis constitutes the most error-prone stage, followed by problem understanding, whereas localization exhibits the lowest failure rate. This suggests that LLMs may excel at fault localization, a task traditionally regarded as one of the most formidable challenges in automated program repair. Furthermore, we observe that robustness and operational costs (particularly in failure scenarios) vary significantly across different models. Finally, we uncover the root causes of these failures and propose actionable strategies to mitigate them. A particularly notable finding is that existing evaluation harnesses occasionally misjudge correct patches due to superficial discrepancies or hidden constraints. Collectively, our insights may provide promising directions for enhancing the effectiveness and reliability of LLM-based issue resolution.