Empirical Evaluation of Large Language Models in Automated Program Repair

📅 2025-06-16

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This study systematically evaluates the automatic program repair (APR) capabilities of four open-source large language models—CodeLlama, LLaMA, StarCoder, and DeepSeek-Coder—across programming languages (Java, C, C++, Python) and defect types (industrial vs. algorithmic). Using six benchmark datasets, four prompting strategies, and multi-dimensional patch evaluation, we conduct a large-scale empirical analysis over 600,000 generated patches. Our findings reveal: (1) code-specialized models significantly outperform general-purpose models of comparable scale; (2) repair performance exhibits nonlinear scaling with model size, and optimal patches are disproportionately concentrated in early-generation outputs; and (3) prompt engineering improves repair rates by over 40%. This work advances beyond traditional APR evaluations—typically limited to small models and single-language settings—by uncovering critical principles governing model specialization, prompt sensitivity, and solution-space distribution in large-language-model-based APR.

Technology Category

Application Category

📝 Abstract

The increasing prevalence of software bugs has made automated program repair (APR) a key research focus. Large language models (LLMs) offer new opportunities for APR, but existing studies mostly rely on smaller, earlier-generation models and Java benchmarks. The repair capabilities of modern, large-scale LLMs across diverse languages and scenarios remain underexplored. To address this, we conduct a comprehensive empirical study of four open-source LLMs, CodeLlama, LLaMA, StarCoder, and DeepSeek-Coder, spanning 7B to 33B parameters, diverse architectures, and purposes. We evaluate them across two bug scenarios (enterprise-grades and algorithmic), three languages (Java, C/C++, Python), and four prompting strategies, analyzing over 600K generated patches on six benchmarks. Key findings include: (1) model specialization (e.g., CodeLlama) can outperform larger general-purpose models (e.g., LLaMA); (2) repair performance does not scale linearly with model size; (3) correct patches often appear early in generation; and (4) prompts significantly affect results. These insights offer practical guidance for designing effective and efficient LLM-based APR systems.

Problem

Research questions and friction points this paper is trying to address.

Evaluating modern LLMs for automated program repair across diverse languages.

Assessing impact of model size and specialization on bug-fixing performance.

Analyzing prompt strategies and patch generation efficiency in APR systems.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates four open-source LLMs for program repair

Tests diverse languages, scenarios, and prompting strategies

Analyzes over 600K patches across six benchmarks

🔎 Similar Papers

A Systematic Literature Review on Large Language Models for Automated Program Repair