SWE-TRACE: Optimizing Long-Horizon SWE Agents Through Rubric Process Reward Models and Heuristic Test-Time Scaling

๐Ÿ“… 2026-04-16
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

203K/year
๐Ÿค– AI Summary
This work addresses key challenges faced by autonomous software engineering agents in real-world settingsโ€”namely, difficulties in long-horizon reasoning, inefficient demonstration data, sparse rewards, and high computational overhead, which often lead to token inflation, reward hacking, and policy degradation. To overcome these issues, the authors propose SWE-TRACE, a unified framework that constructs high-quality trajectory-based supervised fine-tuning (SFT) datasets via multi-task cascaded distillation. They introduce, for the first time, a process reward model (PRM) grounded in scoring criteria, jointly employed during reinforcement learning training and test-time inference optimization to achieve training-inference synergy. Additionally, the framework incorporates trajectory condensation and heuristic action pruning mechanisms to dynamically reduce the search space. Evaluated on standard software engineering benchmarks, SWE-TRACE significantly improves problem-solving success rates while substantially lowering token consumption and inference latency.

Technology Category

Application Category

๐Ÿ“ Abstract
Resolving real-world software engineering (SWE) issues with autonomous agents requires complex, long-horizon reasoning. Current pipelines are bottlenecked by unoptimized demonstration data, sparse execution rewards, and computationally prohibitive inference scaling, which collectively exacerbate token bloat, reward hacking, and policy degradation. We present SWE-TRACE (Trajectory Reduction and Agentic Criteria Evaluation), a unified framework optimizing the SWE agent lifecycle across data curation, reinforcement learning (RL), and test-time inference. First, we introduce an LLM multi-task cascading method, utilizing stepwise oracle verification to distill a 60K-instance Supervised Fine-Tuning (SFT) corpus strictly biased toward token-efficient, shortest-path trajectories. Second, to overcome the instability of sparse outcome rewards, we design a MemoryAugmented Agentic RL pipeline featuring a Rubric-Based Process Reward Model (PRM). An auxiliary Rubric-Agent provides dense, fine-grained heuristic feedback on intermediate steps, guiding the model through long-horizon tasks. Finally, we bridge training and inference by repurposing the PRM for heuristic-guided Test-Time Scaling (TTS). By dynamically evaluating and pruning action candidates at each step, SWE-TRACE achieves superior search efficiency without the latency overhead of standard parallel sampling. Extensive experiments on standard SWE benchmarks demonstrate that SWE-TRACE significantly advances the state-of-the-art, maximizing resolution rates while drastically reducing both token consumption and inference latency.
Problem

Research questions and friction points this paper is trying to address.

long-horizon reasoning
software engineering agents
sparse rewards
token bloat
policy degradation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Process Reward Model
Test-Time Scaling
Long-Horizon Reasoning
Trajectory Distillation
Memory-Augmented RL
๐Ÿ”Ž Similar Papers
No similar papers found.