🤖 AI Summary
Current evaluations of code-generating agents are confined to isolated, single-commit tasks, failing to capture the complexities of real-world software development—such as code evolution, technical debt accumulation, and growing test suites. This work proposes the first multidimensional evaluation framework tailored for continuous software evolution, introducing two realistic scenarios: conversational iterative development and requirement-document-driven development. The framework automatically generates interdependent, long-horizon coding tasks and incorporates dependency-chain pull request simulation, regression validation, and quantitative analysis of cognitive complexity and technical debt. Experimental results reveal that while existing agents can complete isolated tasks, they significantly degrade codebase health, producing code with higher cognitive complexity and greater technical debt. Moreover, conventional evaluation methods overestimate agent performance by up to 20 percentage points.
📝 Abstract
Existing datasets for coding agents evaluate performance on isolated, single pull request (PR) tasks in a stateless manner, failing to capture the reality of real-world software development where code changes accumulate, technical debt accrues, and test suites grow over time. To bridge this gap, we introduce an automated coding task generation framework, which helps generate our dataset SWE-STEPS, that evaluates coding agents on long-horizon tasks through two realistic settings mirroring actual developer workflows: Conversational coding with iterative requests, and single-shot Project Requirement document (PRD)-based coding. Unlike existing datasets that evaluate agents on disjointed Pull Requests (PRs), our framework assesses performance across chains of dependent PRs, enabling evaluation of sequential execution, regression verification, and long-term repository health. We discover that widely used isolated PR evaluations yield inflated success rates, w.r.t. our settings - overshooting performance by as much as 20 percentage points - because they ignore the ``spillover'' effects of previous inefficient or buggy code. Furthermore, our analysis reveals that even when agents successfully resolve issues, they degrade repository health by generating code with higher cognitive complexity and technical debt compared to human developers, underscoring the necessity for multidimensional evaluation.