🤖 AI Summary
In software evolution, developers require precise mapping of specific code regions (e.g., if conditions, function definitions) across commits, yet existing diff tools operate only at the file level, and specialized approaches are constrained by programming language or code element type. This paper proposes a programming-language-agnostic and code-element-agnostic method for cross-commit code region mapping. First, fine-grained diffs generate candidate regions and detect code movements; then, syntax-aware code fragment similarity matching selects optimal targets. Evaluated on four manually annotated datasets spanning ten mainstream programming languages, our method achieves accuracy ranging from 71.0% to 94.5%, outperforming the best baseline by 1.5–58.8 percentage points. The approach significantly enhances both precision and generalizability in code evolution analysis.
📝 Abstract
During software evolution, developers commonly face the problem of mapping a specific code region from one commit to another. For example, they may want to determine how the condition of an if-statement, a specific line in a configuration file, or the definition of a function changes. We call this the code mapping problem. Existing techniques, such as git diff, address this problem only insufficiently because they show all changes made to a file instead of focusing on a code region of the developer's choice. Other techniques focus on specific code elements and programming languages (e.g., methods in Java), limiting their applicability. This paper introduces CodeMapper, an approach to address the code mapping problem in a way that is independent of specific program elements and programming languages. Given a code region in one commit, CodeMapper finds the corresponding region in another commit. The approach consists of two phases: (i) computing candidate regions by analyzing diffs, detecting code movements, and searching for specific code fragments, and (ii) selecting the most likely target region by calculating similarities. Our evaluation applies CodeMapper to four datasets, including two new hand-annotated datasets containing code region pairs in ten popular programming languages. CodeMapper correctly identifies the expected target region in 71.0%--94.5% of all cases, improving over the best available baselines by 1.5--58.8 absolute percent points.