Excision Score: Evaluating Edits with Surgical Precision

📅 2025-10-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluation metrics (e.g., BLEU) rely excessively on surface-level n-gram overlap, failing to accurately capture semantic differences between document revisions—especially in code—leading to substantial divergence from human judgments. To address this, we propose Excision Score (ES), an edit-oriented static metric that isolates semantic differences by excising identical subsequences from both original and revised versions, then comparing only the residual differing regions to precisely identify moves, insertions, and deletions. ES is the first to formalize five sufficiency criteria for revision similarity assessment and introduces a quadratic-time approximation algorithm based on the Longest Common Subsequence (LCS) to balance efficiency and accuracy. On HumanEvalFix, ES achieves a 12% higher Pearson correlation than SARI and over 21% improvement over BLEU and related metrics; under shared-context perturbations, its advantage widens to 20–30%, demonstrating significantly enhanced robustness.

Technology Category

Application Category

📝 Abstract
Many tasks revolve around editing a document, whether code or text. We formulate the revision similarity problem to unify a wide range of machine learning evaluation problems whose goal is to assess a revision to an existing document. We observe that revisions usually change only a small portion of an existing document, so the existing document and its immediate revisions share a majority of their content. We formulate five adequacy criteria for revision similarity measures, designed to align them with human judgement. We show that popular pairwise measures, like BLEU, fail to meet these criteria, because their scores are dominated by the shared content. They report high similarity between two revisions when humans would assess them as quite different. This is a fundamental flaw we address. We propose a novel static measure, Excision Score (ES), which computes longest common subsequence (LCS) to remove content shared by an existing document with the ground truth and predicted revisions, before comparing only the remaining divergent regions. This is analogous to a surgeon creating a sterile field to focus on the work area. We use approximation to speed the standard cubic LCS computation to quadratic. In code-editing evaluation, where static measures are often used as a cheap proxy for passing tests, we demonstrate that ES surpasses existing measures. When aligned with test execution on HumanEvalFix, ES improves over its nearest competitor, SARI, by 12% Pearson correlation and by >21% over standard measures like BLEU. The key criterion is invariance to shared context; when we perturb HumanEvalFix with increased shared context, ES' improvement over SARI increases to 20% and >30% over standard measures. ES also handles other corner cases that other measures do not, such as correctly aligning moved code blocks, and appropriately rewarding matching insertions or deletions.
Problem

Research questions and friction points this paper is trying to address.

Evaluating document edits with precision beyond shared content
Addressing limitations of existing similarity measures like BLEU
Proposing Excision Score to focus on divergent regions only
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses longest common subsequence to remove shared content
Compares only divergent regions after excision process
Employs approximation for quadratic time complexity
🔎 Similar Papers
No similar papers found.
N
Nikolai Gruzinov
JetBrains Research
K
Ksenia Sycheva
JetBrains Research
Earl T. Barr
Earl T. Barr
Professor, University College London
software engineeringcomputer securityprogramming languages
A
Alex Bezzubov
JetBrains Research