SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

📅 2025-09-21

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This study investigates fundamental bottlenecks in AI agents’ ability to autonomously perform long-horizon, multi-file, enterprise-scale software engineering tasks. Method: We introduce SWE-Bench Pro—a rigorously constructed, high-difficulty benchmark comprising 1,865 real-world maintenance tasks—featuring the first evaluation framework explicitly designed for “long-horizon” enterprise development. It partitions tasks into public, hidden, and commercial subsets to balance contamination resistance and real-world representativeness. All tasks undergo human validation and context augmentation to ensure solvability, and we conduct systematic, controlled evaluations of leading code foundation models within a unified agent framework, complemented by failure-mode clustering analysis. Contribution/Results: Experimental results reveal severe limitations: even the strongest model (GPT-5) achieves only 23.3% Pass@1; mainstream models consistently score below 25%. These findings expose critical gaps in current AI systems’ capacity for autonomous, complex engineering reasoning and execution.

Technology Category

Application Category

📝 Abstract

We introduce SWE-Bench Pro, a substantially more challenging benchmark that builds upon the best practices of SWE-BENCH [25], but is explicitly designed to capture realistic, complex, enterprise-level problems beyond the scope of SWE-BENCH. SWE-BENCH PRO contains 1,865 problems sourced from a diverse set of 41 actively maintained repositories spanning business applications, B2B services, and developer tools. The benchmark is partitioned into a public set with open access to problems sourced from 11 repositories, a held-out set of 12 repositories and a commercial set of 18 proprietary repositories where we have formal partnership agreements with early-stage startups. Problems in the held-out and the commercial set are not publicly accessible, but we release results on the commercial set. Our benchmark features long-horizon tasks that may require hours to days for a professional software engineer to complete, often involving patches across multiple files and substantial code modifications. All tasks are human-verified and augmented with sufficient context to ensure resolvability. In our evaluation of widely used coding models, under a unified scaffold, we observe that their performance on SWE-Bench PRO remains below 25% (Pass@1), with GPT-5 achieving the highest score to date at 23.3%. To better understand these limitations, we cluster the failure modes observed in the collected agent trajectories for a clearer characterization of the error patterns exhibited by current models. Overall, SWE-BENCH PRO provides a contamination-resistant testbed that more faithfully captures the complexity and diversity of real-world software development, advancing the pursuit of truly autonomous software engineering agents at a professional level.

Problem

Research questions and friction points this paper is trying to address.

Creating a more challenging benchmark for realistic enterprise-level software engineering problems

Evaluating AI agents' ability to solve long-horizon tasks requiring multi-file modifications

Understanding current models' limitations in handling complex software development scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces a more challenging benchmark for software engineering

Contains long-horizon tasks requiring multi-file modifications

Provides contamination-resistant testbed for realistic development complexity

🔎 Similar Papers

HyperAgent: Generalist Software Engineering Agents to Solve Coding Tasks at Scale