The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason

📅 2025-06-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper challenges the validity of mainstream benchmarks—such as SWE-Bench Verified—in evaluating large language models’ (LLMs) software engineering capabilities, arguing they overestimate genuine reasoning ability due to data contamination and memorization bias. Method: The authors introduce “file-path identification” as a novel diagnostic task—predicting the buggy file path solely from issue descriptions—to decouple memorization from reasoning. They conduct zero-shot evaluation, cross-repository generalization analysis, and dataset ablation studies. Results: State-of-the-art models achieve 76% accuracy on seen repositories but drop sharply to 53% on unseen ones, demonstrating severe reliance on training-data memorization rather than generalizable reasoning. These findings expose a fundamental flaw in current evaluation paradigms and provide both theoretical grounding and methodological guidance for designing memory-robust, generalization-aware software engineering benchmarks.

Technology Category

Application Category

📝 Abstract
As large language models (LLMs) become increasingly capable and widely adopted, benchmarks play a central role in assessing their practical utility. For example, SWE-Bench Verified has emerged as a critical benchmark for evaluating LLMs' software engineering abilities, particularly their aptitude for resolving real-world GitHub issues. Recent LLMs show impressive performance on SWE-Bench, leading to optimism about their capacity for complex coding tasks. However, current evaluation protocols may overstate these models' true capabilities. It is crucial to distinguish LLMs' generalizable problem-solving ability and other learned artifacts. In this work, we introduce a diagnostic task: file path identification from issue descriptions alone, to probe models' underlying knowledge. We present empirical evidence that performance gains on SWE-Bench-Verified may be partially driven by memorization rather than genuine problem-solving. We show that state-of-the-art models achieve up to 76% accuracy in identifying buggy file paths using only issue descriptions, without access to repository structure. This performance is merely up to 53% on tasks from repositories not included in SWE-Bench, pointing to possible data contamination or memorization. These findings raise concerns about the validity of existing results and underscore the need for more robust, contamination-resistant benchmarks to reliably evaluate LLMs' coding abilities.
Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' true problem-solving vs memorization in coding tasks
Evaluating data contamination risks in SWE-Bench software engineering benchmarks
Developing robust benchmarks to measure genuine LLM coding capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diagnostic task for file path identification
Empirical evidence of memorization effects
Need for contamination-resistant benchmarks
🔎 Similar Papers
No similar papers found.