SWE-Replay: Efficient Test-Time Scaling for Software Engineering Agents

πŸ“… 2026-01-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work proposes SWE-Replay, a novel and efficient test-time scaling approach tailored for modern software engineering agents capable of generating custom bash scripts. Unlike conventional methods that incur high computational overhead and rely on error-prone external value estimates, SWE-Replay eliminates the need for an additional value model by reusing historical trajectories. It dynamically decides at critical intermediate steps whether to explore from scratch or branch from prior experience, guided by the repository’s exploration potential and the reasoning significance of each step. Evaluated on SWE-Bench Verified, the method reduces computational cost by up to 17.4% while improving performance by 3.8%. Furthermore, it demonstrates strong generalization across SWE-Bench Pro and the Multilingual benchmark, highlighting its robustness and adaptability.

Technology Category

Application Category

πŸ“ Abstract
Test-time scaling has been widely adopted to enhance the capabilities of Large Language Model (LLM) agents in software engineering (SWE) tasks. However, the standard approach of repeatedly sampling trajectories from scratch is computationally expensive. While recent methods have attempted to mitigate costs using specialized value agents, they can suffer from model miscalibration and fail to generalize to modern agents that synthesize custom bash scripts as tools. In this paper, we introduce SWE-Replay, the first efficient and generalizable test-time scaling technique for modern agents without reliance on potentially noisy value estimates. SWE-Replay optimizes the scaling process by recycling trajectories from prior trials, dynamically choosing to either explore from scratch or exploit archived experience by branching at critical intermediate steps. This selection of intermediate steps is driven by the potential and reasoning significance of repository exploration, rather than external LLM-based quality estimates. Our evaluation shows that, on SWE-Bench Verified, SWE-Replay consistently outperforms naive scaling, reducing costs by up to 17.4% while maintaining or even improving performance by up to 3.8%. Further evaluation on SWE-Bench Pro and Multilingual validates the generalizability of SWE-Replay, establishing it as a robust foundation for efficient test-time scaling of software engineering agents.
Problem

Research questions and friction points this paper is trying to address.

test-time scaling
software engineering agents
trajectory recycling
computational efficiency
generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

test-time scaling
trajectory replay
software engineering agents
cost efficiency
generalizable reasoning
πŸ”Ž Similar Papers
No similar papers found.
Yifeng Ding
Yifeng Ding
University of Illinois at Urbana-Champaign
Software engineeringGenerative model
L
Lingming Zhang
Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign, USA