🤖 AI Summary
This work investigates whether large language models (LLMs) still suffer from test overfitting in automated program repair—i.e., generating patches that pass visible (training) tests but fail on unseen, hidden tests. We conduct the first systematic evaluation of LLMs’ generalization capability at the repository level using the SWE-bench benchmark, introducing a controlled visible/hidden test split framework to quantitatively measure overfitting severity. Our experiments reveal that, despite substantial improvements in overall repair accuracy, test overfitting remains pervasive: for several models and tasks, pass rates on hidden tests drop by over 40% relative to visible tests. These findings expose a critical generalization deficiency in current LLM-based repair approaches. Moreover, we establish a reproducible, empirically grounded evaluation paradigm—comprising standardized test partitioning, rigorous metrics, and benchmarked baselines—to guide future efforts toward more robust and trustworthy automated repair systems.
📝 Abstract
Automated program repair has been shown to be susceptible to generating repaired code that passes on seen tests but fails on a hold-out set of hidden tests. This problem, dubbed test overfitting, has been identified and studied before the rise of large language models. We experimentally study how much test overfitting is still a problem today, using repository-level SWE-bench tasks.