SWE-Lego: Pushing the Limits of Supervised Fine-tuning for Software Issue Resolving

📅 2026-01-04
🏛️ arXiv.org
📈 Citations: 5
Influential: 2
📄 PDF
🤖 AI Summary
This work explores an efficient, lightweight paradigm for addressing software engineering tasks using only supervised fine-tuning (SFT), without relying on reinforcement learning or complex alignment techniques. To this end, we construct a high-quality hybrid dataset combining real-world and synthetically generated samples, and introduce several novel components: an error-masking mechanism, a software engineering–oriented curriculum learning strategy based on task difficulty, and a test-time scaling (TTS) approach integrated with trajectory validation. Our method achieves state-of-the-art performance among open-source models on SWE-bench Verified: SWE-Lego-Qwen3-8B and SWE-Lego-Qwen3-32B attain pass rates of 42.2% and 52.6%, respectively, which further improve to 49.6% and 58.8% under TTS@16.

Technology Category

Application Category

📝 Abstract
We present SWE-Lego, a supervised fine-tuning (SFT) recipe designed to achieve state-ofthe-art performance in software engineering (SWE) issue resolving. In contrast to prevalent methods that rely on complex training paradigms (e.g., mid-training, SFT, reinforcement learning, and their combinations), we explore how to push the limits of a lightweight SFT-only approach for SWE tasks. SWE-Lego comprises three core building blocks, with key findings summarized as follows: 1) the SWE-Lego dataset, a collection of 32k highquality task instances and 18k validated trajectories, combining real and synthetic data to complement each other in both quality and quantity; 2) a refined SFT procedure with error masking and a difficulty-based curriculum, which demonstrably improves action quality and overall performance. Empirical results show that with these two building bricks alone,the SFT can push SWE-Lego models to state-of-the-art performance among open-source models of comparable size on SWE-bench Verified: SWE-Lego-Qwen3-8B reaches 42.2%, and SWE-Lego-Qwen3-32B attains 52.6%. 3) We further evaluate and improve test-time scaling (TTS) built upon the SFT foundation. Based on a well-trained verifier, SWE-Lego models can be significantly boosted--for example, 42.2% to 49.6% and 52.6% to 58.8% under TTS@16 for the 8B and 32B models, respectively.
Problem

Research questions and friction points this paper is trying to address.

supervised fine-tuning
software issue resolving
SWE-bench
lightweight training
code repair
Innovation

Methods, ideas, or system contributions that make the work stand out.

Supervised Fine-tuning
Software Issue Resolving
Error Masking
Curriculum Learning
Test-Time Scaling
🔎 Similar Papers
No similar papers found.
C
Chaofan Tao
Huawei Technologies
Jierun Chen
Jierun Chen
HKUST
Multi-modal ModelsLarge Language ModelsEfficient AI
Y
Yuxin Jiang
Huawei Technologies
K
Kaiqi Kou
Huawei Technologies
S
Shaowei Wang
Huawei Technologies
R
Ruoyu Wang
NTU
X
Xiaohui Li
Huawei Technologies
Sidi Yang
Sidi Yang
The University of Hong Kong
Computer vision
Y
Yiming Du
CUHK
J
Jianbo Dai
Huawei Technologies
Z
Zhiming Mao
CUHK
X
Xinyu Wang
Huawei Technologies
Lifeng Shang
Lifeng Shang
Huawei Noah's Ark Lab
Machine LearningComputer VisionPattern ReconitionNatural Language Processing
Haoli Bai
Haoli Bai
Huawei Technologies
natural language processingmodel compression