Is the Cure Still Worse Than the Disease? Test Overfitting by LLMs in Automated Program Repair

📅 2025-11-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates whether large language models (LLMs) still suffer from test overfitting in automated program repair—i.e., generating patches that pass visible (training) tests but fail on unseen, hidden tests. We conduct the first systematic evaluation of LLMs’ generalization capability at the repository level using the SWE-bench benchmark, introducing a controlled visible/hidden test split framework to quantitatively measure overfitting severity. Our experiments reveal that, despite substantial improvements in overall repair accuracy, test overfitting remains pervasive: for several models and tasks, pass rates on hidden tests drop by over 40% relative to visible tests. These findings expose a critical generalization deficiency in current LLM-based repair approaches. Moreover, we establish a reproducible, empirically grounded evaluation paradigm—comprising standardized test partitioning, rigorous metrics, and benchmarked baselines—to guide future efforts toward more robust and trustworthy automated repair systems.

Technology Category

Application Category

📝 Abstract
Automated program repair has been shown to be susceptible to generating repaired code that passes on seen tests but fails on a hold-out set of hidden tests. This problem, dubbed test overfitting, has been identified and studied before the rise of large language models. We experimentally study how much test overfitting is still a problem today, using repository-level SWE-bench tasks.
Problem

Research questions and friction points this paper is trying to address.

Investigating test overfitting in automated program repair by LLMs
Analyzing repaired code passing seen tests but failing hidden tests
Evaluating test overfitting severity using SWE-bench repository tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating test overfitting in automated program repair
Using SWE-bench repository-level tasks for assessment
Analyzing LLM performance on hidden test sets
🔎 Similar Papers
No similar papers found.