π€ AI Summary
Existing benchmarks for evaluating code-capable agents predominantly emphasize bug fixing while overlooking critical software engineering tasks such as code-related question answering, test generation, and code refactoring, and they often lack comprehensive assessment of code quality. This work proposes the first benchmark suite that systematically covers these three task categories, employing ambiguous instructions to simulate real-world development scenarios. It integrates programmatic validation with human evaluation and introduces runtime reasoning analysis and repository exploration tracking to holistically assess agentsβ engineering capabilities beyond mere functional correctness. Experimental results indicate that GPT-5.4 and Opus 4.7 achieve the strongest performance, yet open-source models generally underperform, and even state-of-the-art systems struggle with edge cases and complex engineering conventions.
π Abstract
We introduce SWE Atlas, a benchmark suite for coding agents spanning three professional software engineering workflows: Codebase Q&A (124 tasks), Test Writing (90 tasks), and Refactoring (70 tasks). SWE Atlas differs from prior SWE benchmarks in three key ways: it targets underrepresented but practically important task categories, uses comprehensive category-specific evaluation protocols, and adopts under-specified, agentic task formulations that better reflect real-world usage. Its evaluation framework combines programmatic checks with rubric-based assessment. This goes beyond functional correctness, evaluating software engineering quality, including test and refactor completeness, maintainability, reusable abstractions, and codebase hygiene. We evaluate a range of frontier and open-weight models on SWE Atlas and find that GPT-5.4 and Opus 4.7 achieve the strongest overall performance, while even the best open-weight models score poorly. Our analysis suggests that top models rely on extensive codebase exploration and runtime-driven reasoning. However, even top models consistently struggle with subtle edge cases, complex runtime analysis, and adherence to software engineering best practices. Overall, SWE Atlas provides a complementary evaluation suite for measuring both correctness and engineering quality in coding agents.