SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks

📅 2026-03-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing evaluations for code-generating agents, which predominantly focus on single-task pass rates while neglecting code scalability and structural degradation over long-term iterative development. To bridge this gap, we propose SlopCodeBench—a language-agnostic, long-horizon iterative coding benchmark comprising 20 problems and 93 checkpoints—and introduce two trajectory-level metrics: “structural erosion” and “redundancy.” Our experiments reveal that none of 11 state-of-the-art models can end-to-end solve any problem (with a peak checkpoint pass rate of only 17.2%). Moreover, 89.8% of agent trajectories exhibit increasing redundancy, and 80% show structural erosion; agent-generated code displays 2.2× the redundancy of human-written code and undergoes continuous degradation, indicating that while interventions may improve initial quality, they fail to prevent long-term deterioration.

Technology Category

Application Category

📝 Abstract
Software development is iterative, yet agentic coding benchmarks overwhelmingly evaluate single-shot solutions against complete specifications. Code can pass the test suite but become progressively harder to extend. Recent iterative benchmarks attempt to close this gap, but constrain the agent's design decisions too tightly to faithfully measure how code quality shapes future extensions. We introduce SlopCodeBench, a language-agnostic benchmark comprising 20 problems and 93 checkpoints, in which agents repeatedly extend their own prior solutions under evolving specifications that force architectural decisions without prescribing internal structure. We track two trajectory-level quality signals: verbosity, the fraction of redundant or duplicated code, and structural erosion, the share of complexity mass concentrated in high-complexity functions. No agent solves any problem end-to-end across 11 models; the highest checkpoint solve rate is 17.2%. Quality degrades steadily: erosion rises in 80% of trajectories and verbosity in 89.8%. Against 48 open-source Python repositories, agent code is 2.2x more verbose and markedly more eroded. Tracking 20 of those repositories over time shows that human code stays flat, while agent code deteriorates with each iteration. A prompt-intervention study shows that initial quality can be improved, but it does not halt degradation. These results demonstrate that pass-rate benchmarks systematically undermeasure extension robustness, and that current agents lack the design discipline iterative software development demands.
Problem

Research questions and friction points this paper is trying to address.

iterative coding
code quality degradation
long-horizon tasks
software extensibility
agent-based code generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

iterative coding
code quality degradation
structural erosion
long-horizon software development
agent benchmarking
🔎 Similar Papers
No similar papers found.