Empirical Evaluation of Large Language Models in Automated Program Repair

📅 2025-06-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically evaluates the automatic program repair (APR) capabilities of four open-source large language models—CodeLlama, LLaMA, StarCoder, and DeepSeek-Coder—across programming languages (Java, C, C++, Python) and defect types (industrial vs. algorithmic). Using six benchmark datasets, four prompting strategies, and multi-dimensional patch evaluation, we conduct a large-scale empirical analysis over 600,000 generated patches. Our findings reveal: (1) code-specialized models significantly outperform general-purpose models of comparable scale; (2) repair performance exhibits nonlinear scaling with model size, and optimal patches are disproportionately concentrated in early-generation outputs; and (3) prompt engineering improves repair rates by over 40%. This work advances beyond traditional APR evaluations—typically limited to small models and single-language settings—by uncovering critical principles governing model specialization, prompt sensitivity, and solution-space distribution in large-language-model-based APR.

Technology Category

Application Category

📝 Abstract
The increasing prevalence of software bugs has made automated program repair (APR) a key research focus. Large language models (LLMs) offer new opportunities for APR, but existing studies mostly rely on smaller, earlier-generation models and Java benchmarks. The repair capabilities of modern, large-scale LLMs across diverse languages and scenarios remain underexplored. To address this, we conduct a comprehensive empirical study of four open-source LLMs, CodeLlama, LLaMA, StarCoder, and DeepSeek-Coder, spanning 7B to 33B parameters, diverse architectures, and purposes. We evaluate them across two bug scenarios (enterprise-grades and algorithmic), three languages (Java, C/C++, Python), and four prompting strategies, analyzing over 600K generated patches on six benchmarks. Key findings include: (1) model specialization (e.g., CodeLlama) can outperform larger general-purpose models (e.g., LLaMA); (2) repair performance does not scale linearly with model size; (3) correct patches often appear early in generation; and (4) prompts significantly affect results. These insights offer practical guidance for designing effective and efficient LLM-based APR systems.
Problem

Research questions and friction points this paper is trying to address.

Evaluating modern LLMs for automated program repair across diverse languages.
Assessing impact of model size and specialization on bug-fixing performance.
Analyzing prompt strategies and patch generation efficiency in APR systems.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates four open-source LLMs for program repair
Tests diverse languages, scenarios, and prompting strategies
Analyzes over 600K patches across six benchmarks
🔎 Similar Papers
No similar papers found.
J
Jiajun Sun
College of Intelligence and Computing, Tianjin University, China
Fengjie Li
Fengjie Li
Tianjin University
Software EngineeringProgram Repair
X
Xinzhu Qi
School of Information and Software Engineering, University of Electronic Science and Technology of China, China
Hongyu Zhang
Hongyu Zhang
Chongqing University
Software EngineeringMining Software RepositoriesData-driven Software EngineeringSoftware Analytics
J
Jiajun Jiang
College of Intelligence and Computing, Tianjin University, China