The Art of Repair: Optimizing Iterative Program Repair with Instruction-Tuned Models

📅 2025-05-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Automatic Program Repair (APR) faces stringent constraints—specifically, generating at most ten patches per single defect—necessitating highly efficient and precise patch generation. Method: We propose a test-feedback-driven self-iterative patch generation framework, systematically evaluating full-parameter fine-tuning versus LoRA adaptation across three instruction-tuned LLMs (DeepSeekCoder-Instruct, CodeLlama-Instruct, Llama3.1-Instruct) on datasets of 1K, 30K, and 65K samples. Contribution/Results: Our empirical study is the first to demonstrate that fine-tuning less than 1% of training data improves plausible patch rate by 78%. Iterative refinement significantly benefits base models and remains indispensable for fine-tuned models on complex defects. We identify a diminishing-return inflection point in fine-tuning: excessive adaptation induces overfitting and degrades performance. On HumanEval-Java and Defects4J, our approach substantially improves repair rates, challenging the prevailing claim that full-parameter fine-tuning is ineffective for APR.

Technology Category

Application Category

📝 Abstract
Automatic program repair (APR) aims to reduce the manual efforts required to identify and fix errors in source code. Before the rise of LLM-based agents, a common strategy was to increase the number of generated patches, sometimes to the thousands, to achieve better repair results on benchmarks. More recently, self-iterative capabilities enabled LLMs to refine patches over multiple rounds guided by feedback. However, literature often focuses on many iterations and disregards different numbers of outputs. We investigate an APR pipeline that balances these two approaches, the generation of multiple outputs and multiple rounds of iteration, while imposing a limit of 10 total patches per bug. We apply three SOTA instruction-tuned LLMs - DeepSeekCoder-Instruct, Codellama-Instruct, Llama3.1-Instruct - to the APR task. We further fine-tune each model on an APR dataset with three sizes (1K, 30K, 65K) and two techniques (Full Fine-Tuning and LoRA), allowing us to assess their repair capabilities on two APR benchmarks: HumanEval-Java and Defects4J. Our results show that by using only a fraction (<1%) of the fine-tuning dataset, we can achieve improvements of up to 78% in the number of plausible patches generated, challenging prior studies that reported limited gains using Full Fine-Tuning. However, we find that exceeding certain thresholds leads to diminishing outcomes, likely due to overfitting. Moreover, we show that base models greatly benefit from creating patches in an iterative fashion rather than generating them all at once. In addition, the benefit of iterative strategies becomes more pronounced in complex benchmarks. Even fine-tuned models, while benefiting less from iterations, still gain advantages, particularly on complex benchmarks. The research underscores the need for balanced APR strategies that combine multi-output generation and iterative refinement.
Problem

Research questions and friction points this paper is trying to address.

Balancing multiple patch outputs and iterative refinement in APR
Evaluating instruction-tuned LLMs for efficient program repair
Assessing fine-tuning impact on patch generation effectiveness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Balances multi-output generation and iterative refinement
Uses instruction-tuned LLMs for program repair
Fine-tunes models with limited dataset effectively
🔎 Similar Papers
No similar papers found.