🤖 AI Summary
This study systematically evaluates the automatic program repair (APR) capabilities of four open-source large language models—CodeLlama, LLaMA, StarCoder, and DeepSeek-Coder—across programming languages (Java, C, C++, Python) and defect types (industrial vs. algorithmic). Using six benchmark datasets, four prompting strategies, and multi-dimensional patch evaluation, we conduct a large-scale empirical analysis over 600,000 generated patches. Our findings reveal: (1) code-specialized models significantly outperform general-purpose models of comparable scale; (2) repair performance exhibits nonlinear scaling with model size, and optimal patches are disproportionately concentrated in early-generation outputs; and (3) prompt engineering improves repair rates by over 40%. This work advances beyond traditional APR evaluations—typically limited to small models and single-language settings—by uncovering critical principles governing model specialization, prompt sensitivity, and solution-space distribution in large-language-model-based APR.
📝 Abstract
The increasing prevalence of software bugs has made automated program repair (APR) a key research focus. Large language models (LLMs) offer new opportunities for APR, but existing studies mostly rely on smaller, earlier-generation models and Java benchmarks. The repair capabilities of modern, large-scale LLMs across diverse languages and scenarios remain underexplored. To address this, we conduct a comprehensive empirical study of four open-source LLMs, CodeLlama, LLaMA, StarCoder, and DeepSeek-Coder, spanning 7B to 33B parameters, diverse architectures, and purposes. We evaluate them across two bug scenarios (enterprise-grades and algorithmic), three languages (Java, C/C++, Python), and four prompting strategies, analyzing over 600K generated patches on six benchmarks. Key findings include: (1) model specialization (e.g., CodeLlama) can outperform larger general-purpose models (e.g., LLaMA); (2) repair performance does not scale linearly with model size; (3) correct patches often appear early in generation; and (4) prompts significantly affect results. These insights offer practical guidance for designing effective and efficient LLM-based APR systems.