SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks

📅 2025-06-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Constructing high-quality GitHub issue-resolution datasets is critical for evaluating LLMs’ software engineering (SE) capabilities, yet conventional manual construction suffers from high costs and low efficiency in environment setup, result scoring, and task validation. This paper proposes SWE-Builder, an end-to-end automated pipeline: (1) it introduces the first multi-agent collaborative framework for constructing reproducible, language-agnostic evaluation environments; (2) it designs a universal, exit-code–based automatic scoring mechanism achieving 100% scoring accuracy; and (3) it develops fail2pass, a high-precision automatic validation framework with recall of 1.00 and precision of 0.92. Evaluated on 671 cross-language real-world GitHub issues, SWE-Builder incurs only $0.024 per instance on average. This work substantially lowers the barrier to building high-fidelity SE benchmarks and establishes a scalable, reproducible paradigm for training and evaluating LLMs’ software engineering proficiency.

Technology Category

Application Category

📝 Abstract
Constructing large-scale datasets for the GitHub issue resolution task is crucial for both training and evaluating the software engineering capabilities of Large Language Models (LLMs). However, the traditional process for creating such benchmarks is notoriously challenging and labor-intensive, particularly in the stages of setting up evaluation environments, grading test outcomes, and validating task instances. In this paper, we propose SWE-Factory, an automated pipeline designed to address these challenges. To tackle these issues, our pipeline integrates three core automated components. First, we introduce SWE-Builder, a multi-agent system that automates evaluation environment construction, which employs four specialized agents that work in a collaborative, iterative loop and leverages an environment memory pool to enhance efficiency. Second, we introduce a standardized, exit-code-based grading method that eliminates the need for manually writing custom parsers. Finally, we automate the fail2pass validation process using these reliable exit code signals. Experiments on 671 issues across four programming languages show that our pipeline can effectively construct valid task instances; for example, with GPT-4.1-mini, our SWE-Builder constructs 269 valid instances at $0.045 per instance, while with Gemini-2.5-flash, it achieves comparable performance at the lowest cost of $0.024 per instance. We also demonstrate that our exit-code-based grading achieves 100% accuracy compared to manual inspection, and our automated fail2pass validation reaches a precision of 0.92 and a recall of 1.00. We hope our automated pipeline will accelerate the collection of large-scale, high-quality GitHub issue resolution datasets for both training and evaluation. Our code and datasets are released at https://github.com/DeepSoftwareAnalytics/swe-factory.
Problem

Research questions and friction points this paper is trying to address.

Automates creation of GitHub issue resolution datasets for LLMs
Replaces labor-intensive benchmark setup and validation processes
Provides cost-effective solution for training data generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated multi-agent environment construction
Standardized exit-code-based grading method
Automated fail2pass validation process
🔎 Similar Papers
No similar papers found.