🤖 AI Summary
Software engineering (SWE) datasets suffer from manual curation, high cost, and limited scale. Method: This paper proposes an incremental, automated data cleaning and generation pipeline that leverages real-world GitHub tasks to construct a high-quality SWE dataset comprising 10,169 Python tasks—supporting long-context and multi-turn interactive code generation. It innovatively integrates natural-language task descriptions with domain-specific execution environment images to enable fully automated unit test validation. Contribution/Results: Empirical analysis reveals a sustained scaling law for SWE capability with increasing data volume—no saturation observed—constituting the first such characterization in the field. The Skywork-SWE-32B model trained on this dataset achieves 38.0% pass@1 on SWE-bench Verified, improving to 47.0% under test-time scaling—outperforming all open-source models ≤32B and establishing a new state-of-the-art within the Qwen2.5-Coder-32B series.
📝 Abstract
Software engineering (SWE) has recently emerged as a crucial testbed for next-generation LLM agents, demanding inherent capabilities in two critical dimensions: sustained iterative problem-solving (e.g., >50 interaction rounds) and long-context dependency resolution (e.g., >32k tokens). However, the data curation process in SWE remains notoriously time-consuming, as it heavily relies on manual annotation for code file filtering and the setup of dedicated runtime environments to execute and validate unit tests. Consequently, most existing datasets are limited to only a few thousand GitHub-sourced instances. To this end, we propose an incremental, automated data-curation pipeline that systematically scales both the volume and diversity of SWE datasets. Our dataset comprises 10,169 real-world Python task instances from 2,531 distinct GitHub repositories, each accompanied by a task specified in natural language and a dedicated runtime-environment image for automated unit-test validation. We have carefully curated over 8,000 successfully runtime-validated training trajectories from our proposed SWE dataset. When fine-tuning the Skywork-SWE model on these trajectories, we uncover a striking data scaling phenomenon: the trained model's performance for software engineering capabilities in LLMs continues to improve as the data size increases, showing no signs of saturation. Notably, our Skywork-SWE model achieves 38.0% pass@1 accuracy on the SWE-bench Verified benchmark without using verifiers or multiple rollouts, establishing a new state-of-the-art (SOTA) among the Qwen2.5-Coder-32B-based LLMs built on the OpenHands agent framework. Furthermore, with the incorporation of test-time scaling techniques, the performance further improves to 47.0% accuracy, surpassing the previous SOTA results for sub-32B parameter models. We release the Skywork-SWE-32B model checkpoint to accelerate future research.