Edit, But Verify: An Empirical Audit of Instructed Code-Editing Benchmarks

πŸ“… 2026-04-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing code editing benchmarks lack ecological validity in terms of programming language distribution, edit intent, and real-world application scenarios, limiting their ability to accurately reflect model capabilities. This work presents the first systematic audit of CanItEdit and EDIT-Bench using authentic developer data from Copilot Arena and GitHub Octoverse. Through quantitative test counts, statement coverage analysis, fail-before/pass-after validation, and cross-issue codebase duplication detection, the study reveals an overemphasis on Python, insufficient representation of TypeScript, and neglect of front-end and back-end development contexts, alongside flaws in certain test suites. Grounded in empirical findings, the paper proposes six principles for constructing more representative code editing benchmarks and releases all audit artifacts to foster community efforts toward ecologically valid evaluation frameworks.
πŸ“ Abstract
Instructed code editing, where an LLM modifies existing code based on a natural language instruction, accounts for roughly 19% of real-world coding assistant interactions. Yet very few benchmarks directly evaluate this capability. From a survey of over 150 code-related benchmarks, we find that only two, CanItEdit and EDIT-Bench, target instructed code editing with human-authored instructions and test-based evaluation. We audit both by comparing their programming languages, edit intents, and application domains against distributions observed in the wild (Copilot Arena, AIDev, GitHub Octoverse), and by measuring test counts, statement coverage, and test scope across all 213 problems. Both benchmarks concentrate over 90\% of evaluation on Python while TypeScript, GitHub's most-used language, is absent. Backend and frontend development, which together constitute 46% of real-world editing activity, are largely missing, and documentation, testing, and maintenance edits (31.4% of human PRs) have zero representation. Both benchmarks have modest test counts (CanItEdit median 13, EDIT-Bench median 4), though CanItEdit compensates with near-complete whole-file coverage and fail-before/pass-after validation. 59\% of EDIT-Bench's low-coverage suites would not detect modifications outside the edit region. EDIT-Bench has 15 problems that are not solved by any of 40 LLMs and 11 of these problems trace failures to poor benchmark artifacts rather than model limitations. Further, 29% of EDIT-Bench problems and 6% of CanItEdit problems share a codebase with at least one other problem within the benchmark. In summary, these benchmarks measure a narrower construct than deployment decisions require. We therefore propose six empirically grounded desiderata and release all audit artifacts so the community can build instructed code-editing benchmarks whose scores reliably reflect real-world editing capability.
Problem

Research questions and friction points this paper is trying to address.

instructed code editing
benchmark evaluation
code generation
large language models
software engineering
Innovation

Methods, ideas, or system contributions that make the work stand out.

code editing
benchmark audit
large language models
test coverage
empirical evaluation
πŸ”Ž Similar Papers