🤖 AI Summary
This work addresses the lack of systematic evaluation benchmarks for code-diff understanding. We propose Diff-XYZ—the first small-scale, standardized benchmark targeting real-world commits and covering three core tasks: diff application, anti-application, and diff generation. Built upon code-change triplets from CommitPackFT, Diff-XYZ introduces multiple diff representations—including unified diff and search-replace formats—and conducts the first systematic evaluation of representation efficacy across diverse model scales and tasks. Experimental results show that large language models achieve significantly better performance on diff generation under the search-replace format, whereas smaller models and analytical tasks (e.g., apply/anti-apply) benefit more from conventional unified diff. Thus, optimal diff representation selection must jointly consider task type and model scale. Diff-XYZ provides empirical evidence and practical guidance for diff representation design and model adaptation in program synthesis and software engineering applications.
📝 Abstract
Reliable handling of code diffs is central to agents that edit and refactor repositories at scale. We introduce Diff-XYZ, a compact benchmark for code-diff understanding with three supervised tasks: apply (old code $+$ diff $
ightarrow$ new code), anti-apply (new code $-$ diff $
ightarrow$ old code), and diff generation (new code $-$ old code $
ightarrow$ diff). Instances in the benchmark are triples $langle extit{old code}, extit{new code}, extit{diff}
angle$ drawn from real commits in CommitPackFT, paired with automatic metrics and a clear evaluation protocol. We use the benchmark to do a focused empirical study of the unified diff format and run a cross-format comparison of different diff representations. Our findings reveal that different formats should be used depending on the use case and model size. For example, representing diffs in search-replace format is good for larger models in the diff generation scenario, yet not suited well for diff analysis and smaller models. The Diff-XYZ benchmark is a reusable foundation for assessing and improving diff handling in LLMs that can aid future development of diff formats and models editing code. The dataset is published on HuggingFace Hub: https://huggingface.co/datasets/JetBrains-Research/diff-xyz.