ATime-Consistent Benchmark for Repository-Level Software Engineering Evaluation

📅 2026-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the unreliability of existing warehouse-scale software engineering evaluations, which are often compromised by synthetic tasks, prompt leakage, and temporal contamination. To mitigate these issues, the authors propose a temporally consistent evaluation benchmark: code repositories are snapshotted at time T₀, and evaluation contexts are constructed exclusively from knowledge available prior to T₀. Natural language tasks are derived from real pull requests merged during the interval (T₀, T₁], enabling A/B testing of agent performance. The approach introduces, for the first time, temporal consistency constraints coupled with multi-granularity prompt control, rigorously decoupling knowledge sources from task timing. Experiments on the DragonFly and React repositories demonstrate that the strongest model achieves file-level F1 scores of 0.8081 and 0.8078, respectively, highlighting the decisive impact of prompt construction on evaluation outcomes.
📝 Abstract
Evaluation of repository-aware software engineering systems is often confounded by synthetic task design, prompt leakage, and temporal contamination between repository knowledge and future code changes. We present a time-consistent benchmark methodology that snapshots a repository at time T0, constructs repository-derived code knowledge using only artifacts available before T0, and evaluates on engineering tasks derived from pull requests merged in the future interval (T0, T1]. Each historical pull request is transformed into a natural-language task through an LLM-assisted prompt-generation pipeline, and the benchmark is formalized as a matched A/B comparison in which the same software engineering agent is evaluated with and without repository-derived code knowledge while all other variables are held constant. We also report a baseline characterization study on two open-source repositories, DragonFly and React, using three Claude-family models and four prompt granularities. Across both repositories, file-level F1 increases monotonically from minimal to guided prompts, reaching 0.8081 on DragonFly and 0.8078 on React for the strongest tested model. These results show that prompt construction is a first-order benchmark variable. More broadly, the benchmark highlights that temporal consistency and prompt control are core validity requirements for repository-aware software engineering evaluation.
Problem

Research questions and friction points this paper is trying to address.

repository-level evaluation
temporal contamination
prompt leakage
software engineering benchmark
time-consistent evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

time-consistent benchmark
repository-level evaluation
temporal contamination
LLM-assisted prompt generation
A/B comparison
🔎 Similar Papers
2024-06-03arXiv.orgCitations: 23