🤖 AI Summary
This study investigates fundamental bottlenecks in AI agents’ ability to autonomously perform long-horizon, multi-file, enterprise-scale software engineering tasks. Method: We introduce SWE-Bench Pro—a rigorously constructed, high-difficulty benchmark comprising 1,865 real-world maintenance tasks—featuring the first evaluation framework explicitly designed for “long-horizon” enterprise development. It partitions tasks into public, hidden, and commercial subsets to balance contamination resistance and real-world representativeness. All tasks undergo human validation and context augmentation to ensure solvability, and we conduct systematic, controlled evaluations of leading code foundation models within a unified agent framework, complemented by failure-mode clustering analysis. Contribution/Results: Experimental results reveal severe limitations: even the strongest model (GPT-5) achieves only 23.3% Pass@1; mainstream models consistently score below 25%. These findings expose critical gaps in current AI systems’ capacity for autonomous, complex engineering reasoning and execution.
📝 Abstract
We introduce SWE-Bench Pro, a substantially more challenging benchmark that builds upon the best practices of SWE-BENCH [25], but is explicitly designed to capture realistic, complex, enterprise-level problems beyond the scope of SWE-BENCH. SWE-BENCH PRO contains 1,865 problems sourced from a diverse set of 41 actively maintained repositories spanning business applications, B2B services, and developer tools. The benchmark is partitioned into a public set with open access to problems sourced from 11 repositories, a held-out set of 12 repositories and a commercial set of 18 proprietary repositories where we have formal partnership agreements with early-stage startups. Problems in the held-out and the commercial set are not publicly accessible, but we release results on the commercial set. Our benchmark features long-horizon tasks that may require hours to days for a professional software engineer to complete, often involving patches across multiple files and substantial code modifications. All tasks are human-verified and augmented with sufficient context to ensure resolvability. In our evaluation of widely used coding models, under a unified scaffold, we observe that their performance on SWE-Bench PRO remains below 25% (Pass@1), with GPT-5 achieving the highest score to date at 23.3%. To better understand these limitations, we cluster the failure modes observed in the collected agent trajectories for a clearer characterization of the error patterns exhibited by current models. Overall, SWE-BENCH PRO provides a contamination-resistant testbed that more faithfully captures the complexity and diversity of real-world software development, advancing the pursuit of truly autonomous software engineering agents at a professional level.