R2E-Gym: Procedural Environments and Hybrid Verifiers for Scaling Open-Weights SWE Agents

📅 2025-04-09

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Open-source large language models face two critical bottlenecks in solving real-world GitHub issues: difficulty in constructing execution environments and poor scalability of test evaluation. Method: This paper introduces AgentGym—the first open-source training and evaluation framework supporting over 8,700 executable SWE-bench tasks. It proposes SYNGEN, a novel synthetic data paradigm that automatically constructs executable environments from code commits, and a hybrid test-time scaling mechanism integrating execution-based and execution-free validators to overcome performance limitations of single-path verification. Contribution/Results: Leveraging a 32B open-source model, we combine synthetic data generation, automated test-case synthesis, back-translation, and multi-strategy validator integration, followed by fine-tuning and inference optimization. Our approach achieves 51% pass@1 on SWE-Bench Verified—the highest reported for open-source SWE agents—matching the performance of closed-source tool-augmented models (e.g., o1, Sonnet-3.5-v2). All environments, models, and trajectory data are fully open-sourced.

Technology Category

Application Category

📝 Abstract

Improving open-source models on real-world SWE tasks (solving GITHUB issues) faces two key challenges: 1) scalable curation of execution environments to train these models, and, 2) optimal scaling of test-time compute. We introduce AgentGym, the largest procedurally-curated executable gym environment for training real-world SWE-agents, consisting of more than 8.7K tasks. AgentGym is powered by two main contributions: 1) SYNGEN: a synthetic data curation recipe that enables scalable curation of executable environments using test-generation and back-translation directly from commits, thereby reducing reliance on human-written issues or unit tests. We show that this enables more scalable training leading to pass@1 performance of 34.4% on SWE-Bench Verified benchmark with our 32B model. 2) Hybrid Test-time Scaling: we provide an in-depth analysis of two test-time scaling axes; execution-based and execution-free verifiers, demonstrating that they exhibit complementary strengths and limitations. Test-based verifiers suffer from low distinguishability, while execution-free verifiers are biased and often rely on stylistic features. Surprisingly, we find that while each approach individually saturates around 42-43%, significantly higher gains can be obtained by leveraging their complementary strengths. Overall, our approach achieves 51% on the SWE-Bench Verified benchmark, reflecting a new state-of-the-art for open-weight SWE-agents and for the first time showing competitive performance with proprietary models such as o1, o1-preview and sonnet-3.5-v2 (with tools). We will open-source our environments, models, and agent trajectories.

Problem

Research questions and friction points this paper is trying to address.

Scalable curation of execution environments for training SWE agents

Optimal scaling of test-time compute for SWE tasks

Hybrid verifiers to leverage complementary strengths in test-time scaling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Procedurally-curated executable gym environment for SWE-agents

Synthetic data curation using test-generation and back-translation

Hybrid test-time scaling with execution-based and execution-free verifiers

🔎 Similar Papers

No similar papers found.