🤖 AI Summary
Existing code generation benchmarks primarily focus on single-turn tasks, which inadequately assess an agent’s ability to maintain codebase correctness under continuously evolving requirements. This work proposes the first multi-turn, state-preserving evaluation paradigm for code generation, introducing Harbor—a benchmark comprising 26 stateful programming tasks and 227 interactive turns. Harbor features stateful workspaces, requirement specifications grounded in observable behavior, and cumulative executable tests. By introducing a dual-metric framework—Multi-turn Pass@4 (MT@4) and Success Rate (SR)—the study reveals significant performance disparities and ranking reversals between single-turn and multi-turn settings. Experiments show that even the strongest agents achieve only around 50% multi-turn success, with overall pass rates by the fifth turn dropping below half of the initial round, while failure patterns exhibit clear stratification between high- and low-performing agents.
📝 Abstract
Coding agents are increasingly used as iterative development partners, but most benchmarks still evaluate one specification followed by one final assessment. This leaves out a basic question: can an agent keep its own codebase working as requirements change? We introduce EvoCode-Bench, a benchmark of 26 stateful coding tasks and 227 evaluated rounds. Each task preserves the agent's workspace for 5-15 rounds, states requirements through observable behavior, and uses cumulative executable tests to check new requirements and still-active prior ones. We evaluate 13 coding agents with two metrics: MT@4, a four-attempt fail-stop multi-round score, and SR, a single-round score from a reference-completed prior state. For most agents, SR exceeds MT@4 by 22-40 points. The gap also changes rankings: the highest-SR agent (78.9) ranks only third in persistent execution (44.0 MT@4). Even the strongest agents achieve only about 50% success on multi-turn metrics, and aggregate pass rate drops below half of round-1 performance by round 5. Failure analysis shows tier-dependent behavior: weaker agents fail early, while stronger agents survive long enough to expose specification-tracking and regression failures. We release the benchmark data and Harbor multi-turn infrastructure.