🤖 AI Summary
This work addresses the overestimation of code repair capabilities in existing test-based benchmarks—such as SWE-Bench Verified—caused by weak test suites that erroneously classify semantically incorrect patches as correct. To mitigate this, the authors propose SWE-ABS, a two-stage framework that enhances test robustness: first, coverage-driven test augmentation via program slicing, followed by mutation testing using adversarially generated incorrect patches to expose semantic blind spots. Applying SWE-ABS reveals that nearly 20% of “resolved” patches in SWE-Bench are actually incorrect. The method strengthens test cases in 50.2% of instances and rejects 19.71% of previously accepted erroneous patches, reducing top agent scores from 78.80% to 62.20%. This substantially improves benchmark reliability and triggers a significant reshuffling of the leaderboard rankings.
📝 Abstract
The SWE-Bench Verified leaderboard is approaching saturation, with the top system achieving 78.80%. However, we show that this performance is inflated. Our re-evaluation reveals that one in five "solved" patches from the top-30 agents are semantically incorrect, passing only because weak test suites fail to expose their errors. We present SWE-ABS, an adversarial framework that strengthens test suites through a two-stage pipeline: (1) coverage-driven augmentation using program slicing to target untested code regions, and (2) mutation-driven adversarial testing that synthesizes plausible but incorrect patches to expose semantic blind spots. On SWE-Bench Verified (500 instances), SWE-ABS strengthens 50.2% of instances, a 25.1x improvement over prior work, and rejects 19.71% of previously passing patches. As a result, the top agent's score decreases from 78.80% to 62.20%, leading to significant leaderboard reshuffling, with the previous top-ranked agent dropping to fifth place.