The Impact of Fine-tuning Large Language Models on Automated Program Repair

📅 2025-07-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically investigates the impact of parameter-efficient fine-tuning (PEFT) on large language models (LLMs) for automated program repair (APR). Addressing the limitations of full fine-tuning—including overfitting, poor generalization, and high computational cost—the work evaluates LoRA, IA3, and full fine-tuning across six code-specific LLMs (CodeGen, CodeT5, StarCoder, DeepSeekCoder, Bloom, and CodeLlama-2) on three APR benchmarks: QuixBugs, Defects4J, and HumanEval-Java. Results demonstrate that PEFT substantially improves model generalization, achieving superior or comparable repair accuracy at significantly lower computational cost—especially under few-shot and cross-project settings. To our knowledge, this is the first empirical study to validate the broad effectiveness of PEFT across multiple LLMs and diverse APR benchmarks within a unified experimental framework. The findings establish a reproducible, resource-efficient methodology for APR, offering practical guidance for deploying LLMs in constrained environments.

Technology Category

Application Category

📝 Abstract
Automated Program Repair (APR) uses various tools and techniques to help developers achieve functional and error-free code faster. In recent years, Large Language Models (LLMs) have gained popularity as components in APR tool chains because of their performance and flexibility. However, training such models requires a significant amount of resources. Fine-tuning techniques have been developed to adapt pre-trained LLMs to specific tasks, such as APR, and enhance their performance at far lower computational costs than training from scratch. In this study, we empirically investigate the impact of various fine-tuning techniques on the performance of LLMs used for APR. Our experiments provide insights into the performance of a selection of state-of-the-art LLMs pre-trained on code. The evaluation is done on three popular APR benchmarks (i.e., QuixBugs, Defects4J and HumanEval-Java) and considers six different LLMs with varying parameter sizes (resp. CodeGen, CodeT5, StarCoder, DeepSeekCoder, Bloom, and CodeLlama-2). We consider three training regimens: no fine-tuning, full fine-tuning, and parameter-efficient fine-tuning (PEFT) using LoRA and IA3. We observe that full fine-tuning techniques decrease the benchmarking performance of various models due to different data distributions and overfitting. By using parameter-efficient fine-tuning methods, we restrict models in the amount of trainable parameters and achieve better results. Keywords: large language models, automated program repair, parameter-efficient fine-tuning, AI4Code, AI4SE, ML4SE.
Problem

Research questions and friction points this paper is trying to address.

Impact of fine-tuning LLMs on Automated Program Repair performance
Comparing no, full, and parameter-efficient fine-tuning for APR
Evaluating six LLMs on three APR benchmarks with different tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning pre-trained LLMs for APR tasks
Parameter-efficient fine-tuning with LoRA and IA3
Evaluating LLMs on multiple APR benchmarks
🔎 Similar Papers
No similar papers found.