What's in a Benchmark? The Case of SWE-Bench in Automated Program Repair

📅 2026-02-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of a systematic understanding of the SWE-Bench benchmark evaluation ecosystem. It presents the first comprehensive analysis of 212 submissions on the SWE-Bench Lite and Verified leaderboards, employing data mining and statistical methods to systematically examine submitting entities, model choices, technical approaches, and code availability. The findings reveal that the ecosystem is predominantly driven by industry, with closed-source large language models—particularly Claude 4 Sonnet—leading current performance rankings, although open-source solutions from academia remain competitive. Beyond mapping the current technical landscape of SWE-Bench, this work underscores the critical importance of enhancing transparency and methodological diversity in the evaluation of large language models for software engineering tasks.

Technology Category

Application Category

📝 Abstract
The rapid progress in Automated Program Repair (APR) has been fueled by advances in AI, particularly large language models (LLMs) and agent-based systems. SWE-Bench is a benchmark designed to evaluate repair systems using real issues mined from popular open-source Python repositories. Its public leaderboards-SWE-Bench Lite and Verified-have become central platforms for tracking progress and comparing solutions. In this paper, we present the first comprehensive study of these two leaderboards, examining who is submitting solutions, the products behind the submissions, the LLMs employed, and the openness of the approaches. We analyze 79 entries submitted to Lite leaderboard and 133 to Verified. Our results show that most entries on both leaderboards originate from industry, particularly small companies and large publicly traded companies. These submissions often achieve top results, although academic contributions-typically open source-also remain competitive. We also find a clear dominance of proprietary LLMs, especially Claude family, with state-of-the-art results on both leaderboards currently achieved by Claude 4 Sonnet. These findings offer insights into the SWE-Bench ecosystem that can guide greater transparency and diversity in future benchmark-driven research.
Problem

Research questions and friction points this paper is trying to address.

Automated Program Repair
SWE-Bench
benchmark
large language models
openness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated Program Repair
SWE-Bench
Large Language Models
Benchmark Analysis
Claude
🔎 Similar Papers