Empirical Evaluation of Generalizable Automated Program Repair with Large Language Models

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses the limited generalization capability of current large language model (LLM)-based automated program repair (APR) approaches. We conduct the first comprehensive empirical study spanning multiple programming languages (Java, JavaScript, Python, PHP), diverse LLMs (13 models, including Llama 3.3, Qwen 2.5 Coder, GPT-4o, and Claude 3.7), and realistic fault localization (FL) errors—using four major benchmarks, including Defects4J. Key findings reveal strong language dependence in model performance, poor cross-language generalization for individual models, and a substantial drop in repair accuracy under realistic FL noise—exposing critical biases in prior evaluations. To mitigate these limitations, we propose a novel “multi-model expert committee” paradigm that orchestrates complementary LLMs, significantly increasing the number of unique correct repairs. Our study establishes a new standard for fair, realistic APR evaluation and informs the design of practical, robust repair frameworks.

Technology Category

Application Category

📝 Abstract

Automated Program Repair (APR) proposes bug fixes to aid developers in maintaining software. The state of the art in this domain focuses on using LLMs, leveraging their strong capabilities to comprehend specifications in natural language and to generate program code. Recent works have shown that LLMs can be used to generate repairs. However, despite the APR community's research achievements and several industry deployments in the last decade, APR still lacks the capabilities to generalize broadly. In this work, we present an intensive empirical evaluation of LLMs for generating patches. We evaluate a diverse set of 13 recent models, including open ones (e.g., Llama 3.3, Qwen 2.5 Coder, and DeepSeek R1 (dist.)) and closed ones (e.g., o3-mini, GPT-4o, Claude 3.7 Sonnet, Gemini 2.0 Flash). In particular, we explore language-agnostic repairs by utilizing benchmarks for Java (e.g., Defects4J), JavaScript (e.g., BugsJS), Python (e.g., BugsInPy), and PHP (e.g., BugsPHP). Besides the generalization between different languages and levels of patch complexity, we also investigate the effects of fault localization (FL) as a preprocessing step and compare the progress for open vs closed models. Our evaluation represents a snapshot of the current repair capabilities of the latest LLMs. Key results include: (1) Different LLMs tend to perform best for different languages, which makes it hard to develop cross-platform repair techniques with single LLMs. (2) The combinations of models add value with respect to uniquely fixed bugs, so a committee of expert models should be considered. (3) Under realistic assumptions of imperfect FL, we observe significant drops in accuracy from the usual practice of using perfect FL. Our findings and insights will help both researchers and practitioners develop reliable and generalizable APR techniques and evaluate them in realistic and fair environments.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs for generalizable automated program repair across languages

Assessing impact of fault localization on repair accuracy in APR

Comparing performance of open vs closed LLMs for bug fixing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes diverse LLMs for program repair

Explores language-agnostic repair benchmarks

Investigates fault localization preprocessing impact

🔎 Similar Papers

A Systematic Literature Review on Large Language Models for Automated Program Repair