🤖 AI Summary
This work addresses the lack of robustness evaluation against realistic syntactic variations in current large language models (LLMs) for automated program repair. To this end, the study systematically introduces eight semantics-preserving code transformations and constructs HEJ-Robust, a benchmark comprising 1,450 variants derived from HumanEval-Java-Bug. Evaluation on this benchmark reveals a significant performance degradation in state-of-the-art fine-tuned LLMs when confronted with minor syntactic perturbations: the average repair success rate across five prominent models drops by more than 50%. These findings underscore a critical deficiency in model robustness and establish HEJ-Robust as an essential foundation for evaluating and advancing future program repair approaches.
📝 Abstract
Recent Large Language Models (LLMs) have shown strong performance on automated program repair across standard benchmarks. However, these benchmarks evaluate models on a single canonical form of buggy code and do not reflect the syntactic variations commonly observed in real-world software, leaving robustness largely unexamined. In this work, we construct HEJ-Robust, a robustness benchmark built from HumanEval-Java-Bug using eight semantics-preserving code transformations, resulting in 1,450 transformed instances. We evaluate five fine-tuned LLMs on this benchmark and show that model performance drops by over 50% under several transformations, indicating that current LLM-based repair models lack robustness to minor syntactic variations.