SWE-bench Goes Live!

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing static benchmarks like SWE-bench suffer from outdated repositories, narrow repository coverage, and heavy reliance on manual curation—leading to poor scalability, overfitting, and data contamination. To address these limitations, we introduce SWE-bench-Live, the first open-source, real-time-updating benchmark for defect repair. It automatically generates 1,319 executable tasks from authentic GitHub issues posted since 2024, spanning 93 actively maintained repositories. Our novel end-to-end automation pipeline integrates dynamic issue crawling, one-click Docker environment provisioning, LLM-driven instance validation, multi-dimensional difficulty modeling, and a controllable evaluation protocol. This design effectively prevents data contamination while overcoming temporal and scalability bottlenecks inherent in static benchmarks. Extensive experiments across state-of-the-art LLMs and autonomous agents reveal substantial optimistic bias in static evaluations. SWE-bench-Live establishes a contamination-resistant, dynamically evolving evaluation standard aligned with real-world software evolution.

Technology Category

Application Category

📝 Abstract
The issue-resolving task, where a model generates patches to fix real-world bugs, has emerged as a critical benchmark for evaluating the capabilities of large language models (LLMs). While SWE-bench and its variants have become standard in this domain, they suffer from key limitations: they have not been updated since their initial releases, cover a narrow set of repositories, and depend heavily on manual effort for instance construction and environment setup. These factors hinder scalability and introduce risks of overfitting and data contamination. In this work, we present extbf{SWE-bench-Live}, a extit{live-updatable} benchmark designed to overcome these challenges. Our initial release consists of 1,319 tasks derived from real GitHub issues created since 2024, spanning 93 repositories. Each task is accompanied by a dedicated Docker image to ensure reproducible execution. Central to our benchmark is method, an automated curation pipeline that streamlines the entire process from instance creation to environment setup, removing manual bottlenecks and enabling scalability and continuous updates. We evaluate a range of state-of-the-art agent frameworks and LLMs on SWE-bench-Live, revealing a substantial performance gap compared to static benchmarks like SWE-bench, even under controlled evaluation conditions. To better understand this discrepancy, we perform detailed analyses across repository origin, issue recency, and task difficulty. By providing a fresh, diverse, and executable benchmark grounded in live repository activity, SWE-bench-Live facilitates rigorous, contamination-resistant evaluation of LLMs and agents in dynamic, real-world software development settings.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to fix real-world software bugs
Overcoming limitations of outdated and narrow benchmarks
Enabling scalable, automated, and contamination-resistant evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Live-updatable benchmark for real-world bugs
Automated curation pipeline for scalability
Docker images ensure reproducible execution
🔎 Similar Papers
No similar papers found.
L
Linghao Zhang
Microsoft
Shilin He
Shilin He
Microsoft Research
LLMSoftware EngineeringNLP
Chaoyun Zhang
Chaoyun Zhang
Microsoft
GUI AgentLLMCausal InferenceAIOpsSpatio-temporal Modelling
Y
Yu Kang
Microsoft
B
Bowen Li
Shanghai Artificial Intelligence Laboratory
C
Chengxing Xie
Shanghai Artificial Intelligence Laboratory
J
J. Wang
Microsoft
M
Maoquan Wang
Microsoft
Y
Yufan Huang
Microsoft
S
Shengyu Fu
Microsoft
E
Elsie Nallipogu
Microsoft
Qingwei Lin
Qingwei Lin
Microsoft
Yingnong Dang
Yingnong Dang
Microsoft
Cloud servicedata analyticssoftware analyticsmachine learninghuman-computer interaction
S
S. Rajmohan
Microsoft
Dongmei Zhang
Dongmei Zhang
Microsoft Research
Software EngineeringMachine LearningInformation Visualization