From SWE-ZERO to SWE-HERO: Execution-free to Execution-based Fine-tuning for Software Engineering Agents

📅 2026-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of open-source software engineering agents in complex codebase understanding and repair tasks, which often suffer from subpar performance and high computational demands. To overcome these challenges, the authors propose a novel two-stage supervised fine-tuning paradigm: first, a model is pretrained on large-scale execution-free trajectories to capture repository-level code semantics, followed by efficient fine-tuning using a small set of execution-feedback-augmented trajectories. Built upon the Qwen2.5-Coder architecture and distilled from Qwen3-Coder-480B trajectories, the resulting model, SWE-HERO-32B, achieves a state-of-the-art 62.2% resolution rate on SWE-bench Verified among open-source models of comparable size. Notably, when trained exclusively on Python data, it attains 44.1% zero-shot cross-language generalization performance, setting a new benchmark for open-source code LLMs.
📝 Abstract
We introduce SWE-ZERO to SWE-HERO, a two-stage SFT recipe that achieves state-of-the-art results on SWE-bench by distilling open-weight frontier LLMs. Our pipeline replaces resource-heavy dependencies with an evolutionary refinement strategy: (1) SWE-ZERO utilizes large-scale, execution-free trajectories to master code semantics and repository-level reasoning, and (2) SWE-HERO applies targeted, execution-backed refinement to transition these semantic intuitions into rigorous engineering workflows. Our empirical results set a new benchmark for open-source models of comparable size. We release a dataset of 300k SWE-ZERO and 13k SWE-HERO trajectories distilled from Qwen3-Coder-480B, alongside a suite of agents based on the Qwen2.5-Coder series. Notably, SWE-HERO-32B achieves a 62.2% resolution rate on SWE-bench Verified. Furthermore, despite being trained exclusively on Python, our agents demonstrate robust zero-shot transferability on SWE-bench Multilingual, reaching 44.1% and confirming the paradigm's generalizability across diverse languages.
Problem

Research questions and friction points this paper is trying to address.

Software Engineering Agents
Code Reasoning
Execution-based Fine-tuning
Cross-language Generalization
SWE-bench
Innovation

Methods, ideas, or system contributions that make the work stand out.

execution-free fine-tuning
execution-based refinement
software engineering agents
trajectory distillation
multilingual generalization
🔎 Similar Papers