Consistent or Sensitive? Automated Code Revision Tools Against Semantics-Preserving Perturbations

📅 2026-02-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the inconsistency of automated code repair (ACR) tools when confronted with semantically equivalent yet syntactically diverse code variants—a critical reliability issue previously unquantified. The authors formally define and systematically measure this “consistency” problem by constructing a large-scale benchmark comprising over 10,000 variants of 2,032 Java methods, spanning nine categories of semantics-preserving perturbations (SPPs). A comprehensive evaluation of five state-of-the-art Transformer-based ACR tools reveals that such perturbations can reduce correct repair rates by up to 45.3%, with failures more pronounced when perturbations occur near the target repair location. Existing mitigation strategies show limited effectiveness, underscoring that ensuring consistent behavior across semantically equivalent inputs remains an open challenge in automated program repair.

Technology Category

Application Category

📝 Abstract
Automated Code Revision (ACR) tools aim to reduce manual effort by automatically generating code revisions based on reviewer feedback. While ACR tools have shown promising performance on historical data, their real-world utility depends on their ability to handle similar code variants expressing the same issue - a property we define as consistency. However, the probabilistic nature of ACR tools often compromises consistency, which may lead to divergent revisions even for semantically equivalent code variants. In this paper, we investigate the extent to which ACR tools maintain consistency when presented with semantically equivalent code variants. To do so, we first designed nine types of semantics-preserving perturbations (SPP) and applied them to 2032 Java methods from real-world GitHub projects, generating over 10K perturbed variants for evaluation. Then we used these perturbations to evaluate the consistency of five state-of-the-art transformer-based ACR tools. We found that the ACR tools'ability to generate correct revisions can drop by up to 45.3%, when presented with semantically equivalent code. The closer the perturbation is to this targeted region, the more likely an ACR tool is to fail to generate the correct revision. We explored potential mitigation strategies that modify the input representation, but found that these attention-guiding heuristics yielded only marginal improvements, thus leaving the solution to this problem as an open research question.
Problem

Research questions and friction points this paper is trying to address.

Automated Code Revision
Consistency
Semantics-Preserving Perturbations
Code Variants
Transformer-based Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated Code Revision
Consistency
Semantics-Preserving Perturbations
Transformer-based Models
Code Robustness
🔎 Similar Papers
No similar papers found.
S
Shirin Pirouzkhah
University of Zurich
S
Souhaila Serbout
University of Zurich
Alberto Bacchelli
Alberto Bacchelli
Associate Professor, Head of ZEST @ University of Zurich
empirical software engineeringcode review