RepoForge: Training a SOTA Fast-thinking SWE Agent with an End-to-End Data Curation Pipeline Synergizing SFT and RL at Scale

📅 2025-08-02

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

To address data scarcity, inefficient evaluation, and high annotation/storage costs in training large language models (LLMs) for software engineering (SWE), this paper proposes an end-to-end automated agent training framework. Methodologically: (1) it introduces fully automated code generation coupled with SPICE-based difficulty annotation, reducing annotation cost by 19,000×; (2) it designs a bubble-free reinforcement learning framework enabling efficient alignment of small models; and (3) it integrates a lightweight sandbox, Ray-based distributed evaluation, intelligent dependency management, and optimized supervised fine-tuning (SFT)/RL pipelines. The core contribution is the open-sourced RepoForge-8B-Agent, which achieves 17.4% accuracy on SWE-Bench-Verified—the highest among non-reasoning LLMs ≤8B parameters. Additionally, it attains 14× model storage compression and >70% evaluation speedup.

Technology Category

Application Category

📝 Abstract

Training software engineering (SWE) LLMs is bottlenecked by expensive infrastructure, inefficient evaluation pipelines, scarce training data, and costly quality control. We present RepoForge, an autonomous, end-to-end pipeline that generates, evaluates, and trains SWE agents at scale. Our key contributions include: (1) RepoForge-8B-Agent, achieving 17.4% on SWE-Bench-Verified~citep{swebench_verified2024}, establishing new state-of-the-art for $leq$8B non-thinking LLMs; (2) 7,304 executable environments auto-generated from real GitHub commits with zero manual intervention; (3) 14$ imes$ storage reduction (1.4GB $ ightarrow$ 102MB per instance) via intelligent dependency management and image pruning; (4) $>$70% faster evaluation using a Ray-powered~citep{ray2018} distributed RepoForge harness; (5) 19,000$ imes$ cheaper labeling through our automated SPICE~citep{spice2024} difficulty assessment technique. By unifying storage-efficient sandboxing, Ray-powered evaluation harness, automated data generation, SPICE-based labeling, and bubble-free RL scaffold, we demonstrate that even $leq$8B models can reach new state-of-the-art performance on demanding benchmarks like SWE-Bench-Verified. Our approach addresses critical bottlenecks in SWE agent training: high storage costs of container-based evaluation, inefficient sequential reward pipelines, limited availability of high-quality training data, expensive manual labeling, and multi-turn RL pipeline bottlenecks.

Problem

Research questions and friction points this paper is trying to address.

High storage costs in container-based SWE agent evaluation

Inefficient sequential reward pipelines for SWE agent training

Scarce high-quality training data and expensive manual labeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end pipeline for SWE agent training

Automated data generation from GitHub commits

Ray-powered distributed evaluation harness

🔎 Similar Papers

System for systematic literature review using multiple AI agents: Concept and an empirical evaluation