When the Specification Emerges: Benchmarking Faithfulness Loss in Long-Horizon Coding Agents

📅 2026-03-17

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses a critical gap in current code agent evaluation frameworks, which typically assume that task specifications are provided fully and upfront, thereby neglecting the iterative, emergent nature of specification refinement inherent in real-world scientific programming. To bridge this gap, the authors introduce the first benchmark tailored to “specification emergence,” comprising 371 verifiable components and approximately 60 rounds of interactive requests. They propose SLUMP (Specification-emergence-based Loss of UMPire fidelity), a novel metric, alongside ProjectGuard—an external state layer designed to enhance specification tracking. Using a five-point fidelity scale and an exposure auditing mechanism, experiments on Claude Code and Codex reveal that incremental specification disclosure substantially degrades model performance. Notably, ProjectGuard recovers 90% of the fidelity gap on Claude Code, increasing fully faithful components from 118 to 181 and reducing severe failures by 23 instances.

Technology Category

Application Category

📝 Abstract

Current coding-agent benchmarks usually pro- vide the full task specification upfront. Real research coding often does not: the intended system is progressively disclosed through in- teraction, requiring the agent to track durable design commitments across a long session. We introduce a benchmark for this setting and study faithfulne Ss Loss U nder eM ergent s Pecification (SLUMP), defined as the reduc- tion in final implementation faithfulness un- der emergent specification relative to a single- shot specification control. The benchmark con- tains 20 recent ML papers (10 ICML 2025, 10 NeurIPS 2025), 371 atomic verifiable compo- nents, and interaction scripts of approximately 60 coding requests that progressively disclose the target design without revealing the paper itself. Final repositories are scored with a five-level component-faithfulness rubric and accompanied by an exposure audit to verify that scored components are recoverable from the visible interaction. Evaluated on Claude Code and Codex, the single-shot specification control achieves higher overall implementation fidelity on 16/20 and 14/20 papers, respectively. Structural integration degrades under emergent specification on both platforms, while seman- tic faithfulness loss is substantial on Claude Code and small on Codex. As a mitigation case study, we introduce ProjectGuard, an exter- nal project-state layer for specification tracking. On Claude Code, ProjectGuard recovers 90% of the faithfulness gap, increases fully faith- ful components from 118 to 181, and reduces severe failures from 72 to 49. These results identify specification tracking as a distinct eval- uation target for long-horizon coding agents.

Problem

Research questions and friction points this paper is trying to address.

emergent specification

faithfulness loss

long-horizon coding agents

specification tracking

coding agent benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

emergent specification

faithfulness loss

long-horizon coding agents