Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs

📅 2025-06-23

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Software engineering (SWE) datasets suffer from manual curation, high cost, and limited scale. Method: This paper proposes an incremental, automated data cleaning and generation pipeline that leverages real-world GitHub tasks to construct a high-quality SWE dataset comprising 10,169 Python tasks—supporting long-context and multi-turn interactive code generation. It innovatively integrates natural-language task descriptions with domain-specific execution environment images to enable fully automated unit test validation. Contribution/Results: Empirical analysis reveals a sustained scaling law for SWE capability with increasing data volume—no saturation observed—constituting the first such characterization in the field. The Skywork-SWE-32B model trained on this dataset achieves 38.0% pass@1 on SWE-bench Verified, improving to 47.0% under test-time scaling—outperforming all open-source models ≤32B and establishing a new state-of-the-art within the Qwen2.5-Coder-32B series.

Technology Category

Application Category

📝 Abstract

Software engineering (SWE) has recently emerged as a crucial testbed for next-generation LLM agents, demanding inherent capabilities in two critical dimensions: sustained iterative problem-solving (e.g., >50 interaction rounds) and long-context dependency resolution (e.g., >32k tokens). However, the data curation process in SWE remains notoriously time-consuming, as it heavily relies on manual annotation for code file filtering and the setup of dedicated runtime environments to execute and validate unit tests. Consequently, most existing datasets are limited to only a few thousand GitHub-sourced instances. To this end, we propose an incremental, automated data-curation pipeline that systematically scales both the volume and diversity of SWE datasets. Our dataset comprises 10,169 real-world Python task instances from 2,531 distinct GitHub repositories, each accompanied by a task specified in natural language and a dedicated runtime-environment image for automated unit-test validation. We have carefully curated over 8,000 successfully runtime-validated training trajectories from our proposed SWE dataset. When fine-tuning the Skywork-SWE model on these trajectories, we uncover a striking data scaling phenomenon: the trained model's performance for software engineering capabilities in LLMs continues to improve as the data size increases, showing no signs of saturation. Notably, our Skywork-SWE model achieves 38.0% pass@1 accuracy on the SWE-bench Verified benchmark without using verifiers or multiple rollouts, establishing a new state-of-the-art (SOTA) among the Qwen2.5-Coder-32B-based LLMs built on the OpenHands agent framework. Furthermore, with the incorporation of test-time scaling techniques, the performance further improves to 47.0% accuracy, surpassing the previous SOTA results for sub-32B parameter models. We release the Skywork-SWE-32B model checkpoint to accelerate future research.

Problem

Research questions and friction points this paper is trying to address.

Automating data curation for scalable software engineering datasets

Enhancing LLM performance in iterative problem-solving and long-context tasks

Overcoming manual annotation and runtime validation bottlenecks in SWE

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated data-curation pipeline for SWE datasets

Runtime-environment images for unit-test validation

Scalable model performance with increasing data size

🔎 Similar Papers

No similar papers found.