Agent-RLVR: Training Software Engineering Agents via Guidance and Environment Rewards

📅 2025-06-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Reinforcement learning from verification feedback (RLVR) for software engineering agents suffers from training instability and poor convergence under sparse reward settings. Method: This paper proposes a novel RLVR framework integrating task verification with multi-source guidance. Its core innovation is the “agent-guidance” mechanism—embedding human pedagogical principles into the RLVR loop to jointly model policy planning, error feedback, and environment interaction. The method incorporates unit-test-driven reward shaping, dynamic guidance injection, fusion of policy-level and execution-level feedback, and retry-based iterative policy updates. Contribution/Results: It achieves the first demonstration of guidance-enhanced RLVR data delivering consistent gains in both training and inference phases. Evaluated on SWE-Bench Verified, Qwen-2.5-72B-Instruct attains a pass@1 score of 27.8%, outperforming the baseline by 18.4 percentage points and substantially surpassing conventional RLVR approaches.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning from Verifiable Rewards (RLVR) has been widely adopted as the de facto method for enhancing the reasoning capabilities of large language models and has demonstrated notable success in verifiable domains like math and competitive programming tasks. However, the efficacy of RLVR diminishes significantly when applied to agentic environments. These settings, characterized by multi-step, complex problem solving, lead to high failure rates even for frontier LLMs, as the reward landscape is too sparse for effective model training via conventional RLVR. In this work, we introduce Agent-RLVR, a framework that makes RLVR effective in challenging agentic settings, with an initial focus on software engineering tasks. Inspired by human pedagogy, Agent-RLVR introduces agent guidance, a mechanism that actively steers the agent towards successful trajectories by leveraging diverse informational cues. These cues, ranging from high-level strategic plans to dynamic feedback on the agent's errors and environmental interactions, emulate a teacher's guidance, enabling the agent to navigate difficult solution spaces and promotes active self-improvement via additional environment exploration. In the Agent-RLVR training loop, agents first attempt to solve tasks to produce initial trajectories, which are then validated by unit tests and supplemented with agent guidance. Agents then reattempt with guidance, and the agent policy is updated with RLVR based on the rewards of these guided trajectories. Agent-RLVR elevates the pass@1 performance of Qwen-2.5-72B-Instruct from 9.4% to 22.4% on SWE-Bench Verified. We find that our guidance-augmented RLVR data is additionally useful for test-time reward model training, shown by further boosting pass@1 to 27.8%. Agent-RLVR lays the groundwork for training agents with RLVR in complex, real-world environments where conventional RL methods struggle.
Problem

Research questions and friction points this paper is trying to address.

Enhancing RLVR for agentic environments with sparse rewards
Improving software engineering agent performance via guided trajectories
Combining human-like guidance and RLVR for complex problem-solving
Innovation

Methods, ideas, or system contributions that make the work stand out.

Agent guidance steers agents via diverse cues
Combines unit tests with RLVR for training
Enhances performance via guided trajectory rewards