🤖 AI Summary
This paper challenges the validity of mainstream benchmarks—such as SWE-Bench Verified—in evaluating large language models’ (LLMs) software engineering capabilities, arguing they overestimate genuine reasoning ability due to data contamination and memorization bias. Method: The authors introduce “file-path identification” as a novel diagnostic task—predicting the buggy file path solely from issue descriptions—to decouple memorization from reasoning. They conduct zero-shot evaluation, cross-repository generalization analysis, and dataset ablation studies. Results: State-of-the-art models achieve 76% accuracy on seen repositories but drop sharply to 53% on unseen ones, demonstrating severe reliance on training-data memorization rather than generalizable reasoning. These findings expose a fundamental flaw in current evaluation paradigms and provide both theoretical grounding and methodological guidance for designing memory-robust, generalization-aware software engineering benchmarks.
📝 Abstract
As large language models (LLMs) become increasingly capable and widely adopted, benchmarks play a central role in assessing their practical utility. For example, SWE-Bench Verified has emerged as a critical benchmark for evaluating LLMs' software engineering abilities, particularly their aptitude for resolving real-world GitHub issues. Recent LLMs show impressive performance on SWE-Bench, leading to optimism about their capacity for complex coding tasks. However, current evaluation protocols may overstate these models' true capabilities. It is crucial to distinguish LLMs' generalizable problem-solving ability and other learned artifacts. In this work, we introduce a diagnostic task: file path identification from issue descriptions alone, to probe models' underlying knowledge. We present empirical evidence that performance gains on SWE-Bench-Verified may be partially driven by memorization rather than genuine problem-solving. We show that state-of-the-art models achieve up to 76% accuracy in identifying buggy file paths using only issue descriptions, without access to repository structure. This performance is merely up to 53% on tasks from repositories not included in SWE-Bench, pointing to possible data contamination or memorization. These findings raise concerns about the validity of existing results and underscore the need for more robust, contamination-resistant benchmarks to reliably evaluate LLMs' coding abilities.