Agentic Reinforcement Learning for Real-World Code Repair

📅 2025-10-24

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Training and evaluating code repair agents on real-world codebases remains challenging due to unstable evaluation—caused by complex build processes and dynamic dependencies. Method: This paper proposes a dual-pipeline training framework: (1) a full-validation pipeline that freezes dependencies and enforces reproducible validation for reliability; and (2) a large-scale reinforcement learning (RL) pipeline in a simplified environment, where Qwen3-32B is distilled from GPT-4.1 trajectories and further optimized via supervised fine-tuning (SFT) and RL. Results: The SFT model matches GPT-4.1’s performance while reducing parameter count by 56×. RL yields 7–20% absolute improvement in repair rate within the matched environment but suffers substantial degradation under distribution shift—providing the first empirical evidence that train-test environmental consistency critically determines generalization. This highlights “environmental alignment” as a fundamental prerequisite for deploying robust code repair agents.

Technology Category

Application Category

📝 Abstract

We tackle the challenge of training reliable code-fixing agents in real repositories, where complex builds and shifting dependencies make evaluation unstable. We developed a verifiable pipeline with success defined as post-fix build validation and improved reproducibility across ~1K real issues by pinning dependencies and disabling automatic upgrades. Building on this, we introduced a scalable simplified pipeline for large-scale reinforcement learning (RL). Using this setup, we supervised fine-tuned Qwen3-32B in the full pipeline and applied RL on top of the SFT model in the simplified environment. The SFT model distilled from GPT-4.1 trajectories performs on par while being 56x smaller, and RL added 7-20% absolute gains under matched train-test conditions. "Thinking mode" was on par or worse in our experiments. Both SFT and RL models failed to generalize across environments, highlighting the importance of matching train-test environments for building reliable real-world code-fixing agents.

Problem

Research questions and friction points this paper is trying to address.

Training reliable code-fixing agents in real repositories with unstable evaluation

Improving reproducibility across real issues by pinning dependencies and builds

Developing scalable reinforcement learning for real-world code repair systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Verifiable pipeline with build validation for code repair

Scalable simplified pipeline for reinforcement learning training

Supervised fine-tuning and RL on SFT model for gains

🔎 Similar Papers

RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning