Scalable Supervising Software Agents with Patch Reasoner

📅 2025-10-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing test-based patch verification methods suffer from scalability bottlenecks: fragile test sandbox construction, scarcity of high-coverage test cases, and vulnerability to “test hacking.” This paper proposes R4P, a reasoning-driven patch verification model that reformulates verification as a multi-patch group-level reasoning task—eliminating reliance on hand-crafted test cases and instead generating dense, deception-resistant reinforcement learning reward signals. Powered by large language models, R4P enables stable optimization during training and test-time scaling during inference. On SWE-bench-verified, R4P achieves 72.2% verification accuracy—50× faster than traditional testing—and boosts Mini-SE agent Pass@1 to 26.2%, a 10-percentage-point improvement; with test-time scaling, performance further rises to 32.8%.

Technology Category

Application Category

📝 Abstract
While large language model agents have advanced software engineering tasks, the unscalable nature of existing test-based supervision is limiting the potential improvement of data scaling. The reason is twofold: (1) building and running test sandbox is rather heavy and fragile, and (2) data with high-coverage tests is naturally rare and threatened by test hacking via edge cases. In this paper, we propose R4P, a patch verifier model to provide scalable rewards for training and testing SWE agents via reasoning. We consider that patch verification is fundamentally a reasoning task, mirroring how human repository maintainers review patches without writing and running new reproduction tests. To obtain sufficient reference and reduce the risk of reward hacking, R4P uses a group-wise objective for RL training, enabling it to verify multiple patches against each other's modification and gain a dense reward for stable training. R4P achieves 72.2% Acc. for verifying patches from SWE-bench-verified, surpassing OpenAI o3. To demonstrate R4P's practicality, we design and train a lite scaffold, Mini-SE, with pure reinforcement learning where all rewards are derived from R4P. As a result, Mini-SE achieves 26.2% Pass@1 on SWE-bench-verified, showing a 10.0% improvement over the original Qwen3-32B. This can be further improved to 32.8% with R4P for test-time scaling. Furthermore, R4P verifies patches within a second, 50x faster than testing on average. The stable scaling curves of rewards and accuracy along with high efficiency reflect R4P's practicality.
Problem

Research questions and friction points this paper is trying to address.

Addressing unscalable test-based supervision for software agents
Providing scalable rewards via patch verification reasoning
Overcoming heavy test sandboxes and rare high-coverage data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Patch verifier model provides scalable rewards via reasoning
Group-wise objective enables multiple patch verification
R4P verifies patches faster than traditional testing methods
🔎 Similar Papers
No similar papers found.