Evaluating the Generalizability of LLMs in Automated Program Repair

📅 2025-03-12

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

Existing APR benchmarks (e.g., Defects4J) overestimate the cross-dataset generalization capability of large language models (LLMs), leading to inflated performance estimates. Method: We systematically evaluate 11 state-of-the-art LLMs—including Codex, CodeLlama, and DeepSeek-Coder—on APR, introducing DEFECTS4J-TRANS, a novel benchmark built via semantics-preserving data transformations to rigorously assess generalization. Contribution/Results: On DEFECTS4J-TRANS, models’ correct/valid patch rates drop by 49.48%/42.90% on average, exposing severe generalization deficits. Prompt engineering—incorporating error context, test cases, or AST information—yields only marginal improvements (up to +136.67%/+121.82%), still falling far short of original performance. This work provides the first empirical evidence of LLMs’ fundamental generalization bottleneck in APR and establishes both a robust new benchmark and methodological guidance for future APR research.

Technology Category

Application Category

📝 Abstract

LLM-based automated program repair methods have attracted significant attention for their state-of-the-art performance. However, they were primarily evaluated on a few well known datasets like Defects4J, raising questions about their effectiveness on new datasets. In this study, we evaluate 11 top-performing LLMs on DEFECTS4J-TRANS, a new dataset derived from transforming Defects4J while maintaining the original semantics. Results from experiments on both Defects4J and DEFECTS4J-TRANS show that all studied LLMs have limited generalizability in APR tasks, with the average number of correct and plausible patches decreasing by 49.48% and 42.90%, respectively, on DEFECTS4J-TRANS. Further investigation into incorporating additional repair-relevant information in repair prompts reveals that, although this information significantly enhances the LLMs' capabilities (increasing the number of correct and plausible patches by up to 136.67% and 121.82%, respectively), performance still falls short of their original results. This indicates that prompt engineering alone is insufficient to substantially enhance LLMs' repair capabilities. Based on our study, we also offer several recommendations for future research.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLM generalizability in automated program repair tasks.

Evaluating LLM performance on new datasets like DEFECTS4J-TRANS.

Investigating prompt engineering's impact on LLM repair capabilities.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated 11 LLMs on DEFECTS4J-TRANS dataset

Incorporated repair-relevant information in prompts

Found prompt engineering insufficient for LLM enhancement

🔎 Similar Papers

A Systematic Literature Review on Large Language Models for Automated Program Repair