SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle

📅 2026-05-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

233K/year
🤖 AI Summary
This work addresses the limitations of existing code agent benchmarks, which rely on preconfigured environments and static evaluations that are prone to parsing biases, thereby failing to authentically assess end-to-end autonomous software problem-solving capabilities. To bridge this gap, we introduce SWE-Cycle, the first end-to-end benchmark encompassing the FullCycle task—spanning environment reconstruction, code implementation, test generation, and their integration—with 489 rigorously curated instances and a novel bare-repository setting that eliminates manual scaffolding. We also develop SWE-Judge, a hybrid evaluation framework combining static analysis and dynamic execution to enable precise assessment. Experimental results reveal a significant performance drop among state-of-the-art large language model agents on FullCycle tasks, highlighting critical bottlenecks in cross-phase coordination and sustained code quality.
📝 Abstract
As autonomous code agents move toward end-to-end software development, evaluating their practical autonomy becomes critical. Current benchmarks hide friction by testing agents in pre-configured environments, and their static evaluation pipelines frequently fail when parsing fully autonomous trajectories. We address these limitations with SWE-Cycle, a benchmark of 489 rigorously filtered instances. SWE-Cycle evaluates agents across three isolated tasks, including environment reconstruction, code implementation, and verification test generation, as well as an end-to-end FullCycle task that integrates all three. The FullCycle task requires agents to work autonomously in a bare repository without human scaffolding. To reliably assess these complex execution paths, we developed SWE-Judge. By combining static code review with dynamic testing, this execution-capable evaluation agent accurately verifies functional correctness and eliminates the systematic measurement errors of traditional static parsers. We evaluate code agents powered by six state-of-the-art LLMs across these four tasks. The results reveal a sharp drop in solve rates when transitioning from isolated tasks to FullCycle execution, exposing critical bottlenecks in handling cross-phase dependencies and maintaining code quality. Together, SWE-Cycle and SWE-Judge provide a comprehensive framework for accurately measuring the end-to-end capabilities of autonomous software agents.
Problem

Research questions and friction points this paper is trying to address.

autonomous code agents
software development
benchmarking
end-to-end evaluation
issue resolution cycle
Innovation

Methods, ideas, or system contributions that make the work stand out.

SWE-Cycle
autonomous code agents
end-to-end evaluation
SWE-Judge
FullCycle task