🤖 AI Summary
Training and evaluating code repair agents on real-world codebases remains challenging due to unstable evaluation—caused by complex build processes and dynamic dependencies. Method: This paper proposes a dual-pipeline training framework: (1) a full-validation pipeline that freezes dependencies and enforces reproducible validation for reliability; and (2) a large-scale reinforcement learning (RL) pipeline in a simplified environment, where Qwen3-32B is distilled from GPT-4.1 trajectories and further optimized via supervised fine-tuning (SFT) and RL. Results: The SFT model matches GPT-4.1’s performance while reducing parameter count by 56×. RL yields 7–20% absolute improvement in repair rate within the matched environment but suffers substantial degradation under distribution shift—providing the first empirical evidence that train-test environmental consistency critically determines generalization. This highlights “environmental alignment” as a fundamental prerequisite for deploying robust code repair agents.
📝 Abstract
We tackle the challenge of training reliable code-fixing agents in real repositories, where complex builds and shifting dependencies make evaluation unstable. We developed a verifiable pipeline with success defined as post-fix build validation and improved reproducibility across ~1K real issues by pinning dependencies and disabling automatic upgrades. Building on this, we introduced a scalable simplified pipeline for large-scale reinforcement learning (RL). Using this setup, we supervised fine-tuned Qwen3-32B in the full pipeline and applied RL on top of the SFT model in the simplified environment. The SFT model distilled from GPT-4.1 trajectories performs on par while being 56x smaller, and RL added 7-20% absolute gains under matched train-test conditions. "Thinking mode" was on par or worse in our experiments. Both SFT and RL models failed to generalize across environments, highlighting the importance of matching train-test environments for building reliable real-world code-fixing agents.