HEJ-Robust: A Robustness Benchmark for LLM-Based Automated Program Repair

📅 2026-05-04
📈 Citations: 0
Influential: 0
📄 PDF

career value

218K/year
🤖 AI Summary
This work addresses the lack of robustness evaluation against realistic syntactic variations in current large language models (LLMs) for automated program repair. To this end, the study systematically introduces eight semantics-preserving code transformations and constructs HEJ-Robust, a benchmark comprising 1,450 variants derived from HumanEval-Java-Bug. Evaluation on this benchmark reveals a significant performance degradation in state-of-the-art fine-tuned LLMs when confronted with minor syntactic perturbations: the average repair success rate across five prominent models drops by more than 50%. These findings underscore a critical deficiency in model robustness and establish HEJ-Robust as an essential foundation for evaluating and advancing future program repair approaches.
📝 Abstract
Recent Large Language Models (LLMs) have shown strong performance on automated program repair across standard benchmarks. However, these benchmarks evaluate models on a single canonical form of buggy code and do not reflect the syntactic variations commonly observed in real-world software, leaving robustness largely unexamined. In this work, we construct HEJ-Robust, a robustness benchmark built from HumanEval-Java-Bug using eight semantics-preserving code transformations, resulting in 1,450 transformed instances. We evaluate five fine-tuned LLMs on this benchmark and show that model performance drops by over 50% under several transformations, indicating that current LLM-based repair models lack robustness to minor syntactic variations.
Problem

Research questions and friction points this paper is trying to address.

automated program repair
Large Language Models
robustness
code transformations
syntactic variations
Innovation

Methods, ideas, or system contributions that make the work stand out.

robustness benchmark
automated program repair
Large Language Models
semantics-preserving transformation
code robustness