🤖 AI Summary
To address data scarcity, inefficient evaluation, and high annotation/storage costs in training large language models (LLMs) for software engineering (SWE), this paper proposes an end-to-end automated agent training framework. Methodologically: (1) it introduces fully automated code generation coupled with SPICE-based difficulty annotation, reducing annotation cost by 19,000×; (2) it designs a bubble-free reinforcement learning framework enabling efficient alignment of small models; and (3) it integrates a lightweight sandbox, Ray-based distributed evaluation, intelligent dependency management, and optimized supervised fine-tuning (SFT)/RL pipelines. The core contribution is the open-sourced RepoForge-8B-Agent, which achieves 17.4% accuracy on SWE-Bench-Verified—the highest among non-reasoning LLMs ≤8B parameters. Additionally, it attains 14× model storage compression and >70% evaluation speedup.
📝 Abstract
Training software engineering (SWE) LLMs is bottlenecked by expensive infrastructure, inefficient evaluation pipelines, scarce training data, and costly quality control. We present RepoForge, an autonomous, end-to-end pipeline that generates, evaluates, and trains SWE agents at scale. Our key contributions include: (1) RepoForge-8B-Agent, achieving 17.4% on SWE-Bench-Verified~citep{swebench_verified2024}, establishing new state-of-the-art for $leq$8B non-thinking LLMs; (2) 7,304 executable environments auto-generated from real GitHub commits with zero manual intervention; (3) 14$ imes$ storage reduction (1.4GB $
ightarrow$ 102MB per instance) via intelligent dependency management and image pruning; (4) $>$70% faster evaluation using a Ray-powered~citep{ray2018} distributed RepoForge harness; (5) 19,000$ imes$ cheaper labeling through our automated SPICE~citep{spice2024} difficulty assessment technique. By unifying storage-efficient sandboxing, Ray-powered evaluation harness, automated data generation, SPICE-based labeling, and bubble-free RL scaffold, we demonstrate that even $leq$8B models can reach new state-of-the-art performance on demanding benchmarks like SWE-Bench-Verified. Our approach addresses critical bottlenecks in SWE agent training: high storage costs of container-based evaluation, inefficient sequential reward pipelines, limited availability of high-quality training data, expensive manual labeling, and multi-turn RL pipeline bottlenecks.