SWE-bench Goes Live!

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Existing static benchmarks like SWE-bench suffer from outdated repositories, narrow repository coverage, and heavy reliance on manual curation—leading to poor scalability, overfitting, and data contamination. To address these limitations, we introduce SWE-bench-Live, the first open-source, real-time-updating benchmark for defect repair. It automatically generates 1,319 executable tasks from authentic GitHub issues posted since 2024, spanning 93 actively maintained repositories. Our novel end-to-end automation pipeline integrates dynamic issue crawling, one-click Docker environment provisioning, LLM-driven instance validation, multi-dimensional difficulty modeling, and a controllable evaluation protocol. This design effectively prevents data contamination while overcoming temporal and scalability bottlenecks inherent in static benchmarks. Extensive experiments across state-of-the-art LLMs and autonomous agents reveal substantial optimistic bias in static evaluations. SWE-bench-Live establishes a contamination-resistant, dynamically evolving evaluation standard aligned with real-world software evolution.

Technology Category

Application Category

📝 Abstract

The issue-resolving task, where a model generates patches to fix real-world bugs, has emerged as a critical benchmark for evaluating the capabilities of large language models (LLMs). While SWE-bench and its variants have become standard in this domain, they suffer from key limitations: they have not been updated since their initial releases, cover a narrow set of repositories, and depend heavily on manual effort for instance construction and environment setup. These factors hinder scalability and introduce risks of overfitting and data contamination. In this work, we present extbf{SWE-bench-Live}, a extit{live-updatable} benchmark designed to overcome these challenges. Our initial release consists of 1,319 tasks derived from real GitHub issues created since 2024, spanning 93 repositories. Each task is accompanied by a dedicated Docker image to ensure reproducible execution. Central to our benchmark is method, an automated curation pipeline that streamlines the entire process from instance creation to environment setup, removing manual bottlenecks and enabling scalability and continuous updates. We evaluate a range of state-of-the-art agent frameworks and LLMs on SWE-bench-Live, revealing a substantial performance gap compared to static benchmarks like SWE-bench, even under controlled evaluation conditions. To better understand this discrepancy, we perform detailed analyses across repository origin, issue recency, and task difficulty. By providing a fresh, diverse, and executable benchmark grounded in live repository activity, SWE-bench-Live facilitates rigorous, contamination-resistant evaluation of LLMs and agents in dynamic, real-world software development settings.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to fix real-world software bugs

Overcoming limitations of outdated and narrow benchmarks

Enabling scalable, automated, and contamination-resistant evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Live-updatable benchmark for real-world bugs

Automated curation pipeline for scalability

Docker images ensure reproducible execution

🔎 Similar Papers

No similar papers found.