SWE-smith: Scaling Data for Software Engineering Agents

๐Ÿ“… 2025-04-30
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
In software engineering, training language models faces three key bottlenecks: data scarcity, high construction costs (hundreds of human hours plus TB-scale infrastructure), and limited scale (โ‰ค11 repositories, thousands of samples). This paper introduces the first end-to-end automated framework for repository-level training data generation targeting arbitrary Python codebases. It integrates static program analysis, test-driven task synthesis, and sandboxed execution environment construction to fully automate the pipeline from source code to annotated repair tasks. Applied to 128 GitHub repositories, it generates 50K high-quality instancesโ€”over an order of magnitude larger than prior work. Fine-tuning SWE-agent-LM-32B on this dataset achieves 40.2% Pass@1 on SWE-bench Verified, setting a new state-of-the-art for open-source models. The complete toolchain and dataset are publicly released, substantially lowering the barrier to AI-for-Software-Engineering research.

Technology Category

Application Category

๐Ÿ“ Abstract
Despite recent progress in Language Models (LMs) for software engineering, collecting training data remains a significant pain point. Existing datasets are small, with at most 1,000s of training instances from 11 or fewer GitHub repositories. The procedures to curate such datasets are often complex, necessitating hundreds of hours of human labor; companion execution environments also take up several terabytes of storage, severely limiting their scalability and usability. To address this pain point, we introduce SWE-smith, a novel pipeline for generating software engineering training data at scale. Given any Python codebase, SWE-smith constructs a corresponding execution environment, then automatically synthesizes 100s to 1,000s of task instances that break existing test(s) in the codebase. Using SWE-smith, we create a dataset of 50k instances sourced from 128 GitHub repositories, an order of magnitude larger than all previous works. We train SWE-agent-LM-32B, achieving 40.2% Pass@1 resolve rate on the SWE-bench Verified benchmark, state of the art among open source models. We open source SWE-smith (collection procedure, task instances, trajectories, models) to lower the barrier of entry for research in LM systems for automated software engineering. All assets available at https://swesmith.com.
Problem

Research questions and friction points this paper is trying to address.

Limited training data for software engineering LMs
Complex curation process with high human labor
Small existing datasets with scalability issues
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatically synthesizes task instances from codebases
Constructs execution environments for Python codebases
Generates large-scale software engineering training data
๐Ÿ”Ž Similar Papers
No similar papers found.