SWE-ABS: Adversarial Benchmark Strengthening Exposes Inflated Success Rates on Test-based Benchmark

📅 2026-02-28

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This work addresses the overestimation of code repair capabilities in existing test-based benchmarks—such as SWE-Bench Verified—caused by weak test suites that erroneously classify semantically incorrect patches as correct. To mitigate this, the authors propose SWE-ABS, a two-stage framework that enhances test robustness: first, coverage-driven test augmentation via program slicing, followed by mutation testing using adversarially generated incorrect patches to expose semantic blind spots. Applying SWE-ABS reveals that nearly 20% of “resolved” patches in SWE-Bench are actually incorrect. The method strengthens test cases in 50.2% of instances and rejects 19.71% of previously accepted erroneous patches, reducing top agent scores from 78.80% to 62.20%. This substantially improves benchmark reliability and triggers a significant reshuffling of the leaderboard rankings.

Technology Category

Application Category

📝 Abstract

The SWE-Bench Verified leaderboard is approaching saturation, with the top system achieving 78.80%. However, we show that this performance is inflated. Our re-evaluation reveals that one in five "solved" patches from the top-30 agents are semantically incorrect, passing only because weak test suites fail to expose their errors. We present SWE-ABS, an adversarial framework that strengthens test suites through a two-stage pipeline: (1) coverage-driven augmentation using program slicing to target untested code regions, and (2) mutation-driven adversarial testing that synthesizes plausible but incorrect patches to expose semantic blind spots. On SWE-Bench Verified (500 instances), SWE-ABS strengthens 50.2% of instances, a 25.1x improvement over prior work, and rejects 19.71% of previously passing patches. As a result, the top agent's score decreases from 78.80% to 62.20%, leading to significant leaderboard reshuffling, with the previous top-ranked agent dropping to fifth place.

Problem

Research questions and friction points this paper is trying to address.

test-based benchmark

inflated success rates

semantically incorrect patches

weak test suites

leaderboard saturation

Innovation

Methods, ideas, or system contributions that make the work stand out.

adversarial testing

test suite augmentation

program slicing