SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development

📅 2025-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the prevalent yet systematically underexplored problem of feature-driven development (FDD) in real-world software engineering. We introduce SWE-Dev, the first large-scale benchmark for FDD—comprising 14K training and 500 test instances—each accompanied by a reproducible sandbox environment and developer-written, executable unit tests. We propose a verifiable evaluation framework tailored to functional development, enabling, for the first time, automated reward signal generation directly from unit test outcomes to support both supervised fine-tuning (SFT) and test-feedback-driven reinforcement learning (RL). Experimental results show that Claude-3.7-Sonnet achieves only 22.45% Pass@3 on the challenging test set; in contrast, a 7B model fine-tuned on SWE-Dev matches GPT-4o’s performance, empirically validating both the high quality of our data and the efficacy of our methodology.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have shown strong capability in diverse software engineering tasks, e.g. code completion, bug fixing, and document generation. However, feature-driven development (FDD), a highly prevalent real-world task that involves developing new functionalities for large, existing codebases, remains underexplored. We therefore introduce SWE-Dev, the first large-scale dataset (with 14,000 training and 500 test samples) designed to evaluate and train autonomous coding systems on real-world feature development tasks. To ensure verifiable and diverse training, SWE-Dev uniquely provides all instances with a runnable environment and its developer-authored executable unit tests. This collection not only provides high-quality data for Supervised Fine-Tuning (SFT), but also enables Reinforcement Learning (RL) by delivering accurate reward signals from executable unit tests. Our extensive evaluations on SWE-Dev, covering 17 chatbot LLMs, 10 reasoning models, and 10 Multi-Agent Systems (MAS), reveal that FDD is a profoundly challenging frontier for current AI (e.g., Claude-3.7-Sonnet achieves only 22.45% Pass@3 on the hard test split). Crucially, we demonstrate that SWE-Dev serves as an effective platform for model improvement: fine-tuning on training set enabled a 7B model comparable to GPT-4o on extit{hard} split, underscoring the value of its high-quality training data. Code is available here href{https://github.com/justLittleWhite/SWE-Dev}{https://github.com/justLittleWhite/SWE-Dev}.
Problem

Research questions and friction points this paper is trying to address.

Evaluating autonomous coding systems for feature-driven software development
Addressing underexplored real-world feature development in large codebases
Providing high-quality dataset for training and testing AI models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale dataset for feature-driven development tasks
Runnable environments with executable unit tests
Enables Supervised Fine-Tuning and Reinforcement Learning
🔎 Similar Papers
No similar papers found.
Yaxin Du
Yaxin Du
Shanghai Jiao Tong University
federated learningLLM agents
Y
Yuzhu Cai
Beijing University of Aeronautics and Astronautics
Y
Yifan Zhou
Soochow University
C
Cheng Wang
Shanghai Jiao Tong University
Y
Yu Qian
Shanghai Jiao Tong University
Xianghe Pang
Xianghe Pang
Shanghai Jiao Tong University
LLM Agent
Q
Qian Liu
Tiktok
Y
Yue Hu
University of Michigan
Siheng Chen
Siheng Chen
Shanghai Jiao Tong University
Collective intelligenceLLM agentgraph signal processingcollaborative perception