SWE-Mirror: Scaling Issue-Resolving Datasets by Mirroring Issues Across Repositories

📅 2025-09-10

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

To address the challenge of scaling and rigorously validating training data for real-world code problems, this paper introduces SWE-Mirror: the first framework enabling semantic migration of authentic GitHub issues into existing Gym-based environment repositories. Leveraging cross-repository mirroring, semantic issue extraction, and task regeneration, SWE-Mirror automatically constructs executable, verifiable tasks across 40 open-source repositories in 4 programming languages. This approach overcomes two key bottlenecks—namely, the lack of real-world grounding in synthetic tasks and the prohibitive cost of manual task curation. The resulting dataset comprises 60,671 high-quality, execution-verified tasks. Empirically, fine-tuning Qwen2.5-Coder-Instruct (7B/32B) on this data yields substantial performance gains on SWE-Bench-Verified, improving pass rates by 21.8% and 46.0%, respectively—achieving new state-of-the-art (SOTA) results.

Technology Category

Application Category

📝 Abstract

Creating large-scale verifiable training datasets for issue-resolving tasks is a critical yet notoriously difficult challenge. Existing methods on automating the Gym environment setup process for real-world issues suffer from low success rates and high overhead. Meanwhile, synthesizing new tasks within existing Gym environments leaves the vast pool of authentic, human-reported problems untapped. To maximize the utilization of existing Gym environments and also the rich data of issue-resolving history on GitHub, we introduce SWE-Mirror, a pipeline that distills a real-world issue's semantic essence, mirrors it into another repository with a configured Gym environment, and re-animates it as a verifiable issue-resolving task. SWE-Mirror reuses existing Gym environments along with the vast pool of issue-resolving history hosted on GitHub to construct a large-scale dataset of mirrored authentic and verifiable tasks. Applying SWE-Mirror to 40 repositories across 4 languages, we have curated a dataset with 60,671 issue-resolving tasks and demonstrated the value of our dataset by training and evaluating coding agents at various scale. Post-training experiments show that models trained with the dataset exhibit improvements in issue-resolving capabilities. Furthermore, by extending the dataset size to over 12,000 high-quality trajectories, we established a new state-of-the-art (SOTA) among Qwen2.5-Coder-Instruct based LLMs on the OpenHands agent framework, which increases the resolve rate on SWE-Bench-Verified by +21.8% for the 7B model and +46.0% for the 32B model and validates the effectiveness of our approach.

Problem

Research questions and friction points this paper is trying to address.

Creating large-scale verifiable datasets for issue-resolving tasks

Low success rates in automating Gym environment setup processes

Maximizing utilization of existing Gym environments and GitHub issue data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mirrors issues across repositories to reuse environments

Distills semantic essence to create verifiable tasks

Leverages GitHub history for large-scale dataset construction

🔎 Similar Papers

No similar papers found.