π€ AI Summary
This work proposes SWE-Replay, a novel and efficient test-time scaling approach tailored for modern software engineering agents capable of generating custom bash scripts. Unlike conventional methods that incur high computational overhead and rely on error-prone external value estimates, SWE-Replay eliminates the need for an additional value model by reusing historical trajectories. It dynamically decides at critical intermediate steps whether to explore from scratch or branch from prior experience, guided by the repositoryβs exploration potential and the reasoning significance of each step. Evaluated on SWE-Bench Verified, the method reduces computational cost by up to 17.4% while improving performance by 3.8%. Furthermore, it demonstrates strong generalization across SWE-Bench Pro and the Multilingual benchmark, highlighting its robustness and adaptability.
π Abstract
Test-time scaling has been widely adopted to enhance the capabilities of Large Language Model (LLM) agents in software engineering (SWE) tasks. However, the standard approach of repeatedly sampling trajectories from scratch is computationally expensive. While recent methods have attempted to mitigate costs using specialized value agents, they can suffer from model miscalibration and fail to generalize to modern agents that synthesize custom bash scripts as tools. In this paper, we introduce SWE-Replay, the first efficient and generalizable test-time scaling technique for modern agents without reliance on potentially noisy value estimates. SWE-Replay optimizes the scaling process by recycling trajectories from prior trials, dynamically choosing to either explore from scratch or exploit archived experience by branching at critical intermediate steps. This selection of intermediate steps is driven by the potential and reasoning significance of repository exploration, rather than external LLM-based quality estimates. Our evaluation shows that, on SWE-Bench Verified, SWE-Replay consistently outperforms naive scaling, reducing costs by up to 17.4% while maintaining or even improving performance by up to 3.8%. Further evaluation on SWE-Bench Pro and Multilingual validates the generalizability of SWE-Replay, establishing it as a robust foundation for efficient test-time scaling of software engineering agents.