SWE-Dev: Building Software Engineering Agents with Training and Inference Scaling

📅 2025-06-09

📈 Citations: 1

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Software engineering (SWE) agents face performance bottlenecks due to scarcity of high-quality training data and insufficient reliable test cases. Method: This paper introduces an open-source large language model agent framework tailored for realistic SWE tasks. Its core innovations are: (1) a robust pipeline for synthesizing *verification-aware* test cases—ensuring functional correctness and behavioral fidelity; and (2) a scalable data construction methodology integrating trajectory distillation, tool-augmented reasoning, and synthetic-test-driven reinforcement learning to generate high-quality agent trajectories. Contribution/Results: Evaluated on the SWE-bench-Verified benchmark, our released models—SWE-Dev 7B and SWE-Dev 32B—achieve success rates of 23.4% and 36.6%, respectively, setting new state-of-the-art results among open-source SWE agents. All code, model weights, and training data are publicly released to foster reproducibility and community advancement.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have advanced rapidly from conversational problem solving to addressing real-world tasks involving tool use, such as software engineering (SWE). Recent LLM-powered toolkits, such as OpenAI Codex and Cursor, have offered end-to-end automation of the software development process. However, building effective SWE agents remains challenging due to the lack of high-quality training data and effective test cases. To address this issue, we present SWE-Dev, an SWE agent built upon open-source LLMs. First, we develop a robust pipeline to synthesize test cases for patch evaluation. Second, we scale up agent trajectories to construct the training data for building SWE-Dev. Experiments on the SWE-bench-Verified benchmark show that the SWE-Dev models can achieve top performance among all open SWE agents. Specifically, the success rates of the SWE-Dev 7B and 32B parameter models reach 23.4% and 36.6%, respectively, outperforming state-of-the-art open-source models. All code, models, and datasets are publicly available at https://github.com/THUDM/SWE-Dev.

Problem

Research questions and friction points this paper is trying to address.

Lack of high-quality training data for SWE agents

Insufficient effective test cases for patch evaluation

Challenges in scaling agent trajectories for training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthesizes test cases for patch evaluation

Scales agent trajectories for training data

Uses open-source LLMs for software engineering

🔎 Similar Papers

Large Language Model-Based Agents for Software Engineering: A Survey