Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute

๐Ÿ“… 2025-03-31
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
How can open-weight small-language models (SLMs) be deployed cost-effectively in private environments while matching the code reasoning capabilities of proprietary large language models? This paper proposes an intra- and inter-stage test-time computation scaling framework. Internally, it introduces context-aware trajectory synthesis and rejection sampling; externally, it integrates development-process-informed reward modeling, execution-feedback-driven search, and dynamic token allocation. Crucially, the method enhances inference-time computation without increasing model parameters. On the SWE-bench Verified benchmark, a 32B open-weight model achieves a 46% problem-solving rateโ€”surpassing both DeepSeek-R1 (671B) and OpenAI o1. Empirical analysis demonstrates that the model autonomously scales inference length according to problem difficulty, providing the first empirical validation of effective test-time compute scaling for code reasoning.

Technology Category

Application Category

๐Ÿ“ Abstract
Recent advancements in software engineering agents have demonstrated promising capabilities in automating program improvements. However, their reliance on closed-source or resource-intensive models introduces significant deployment challenges in private environments, prompting a critical question: extit{How can personally deployable open-source LLMs achieve comparable code reasoning performance?} To this end, we propose a unified Test-Time Compute scaling framework that leverages increased inference-time computation instead of larger models. Our framework incorporates two complementary strategies: internal TTC and external TTC. Internally, we introduce a extit{development-contextualized trajectory synthesis} method leveraging real-world software repositories to bootstrap multi-stage reasoning processes, such as fault localization and patch generation. We further enhance trajectory quality through rejection sampling, rigorously evaluating trajectories along accuracy and complexity. Externally, we propose a novel extit{development-process-based search} strategy guided by reward models and execution verification. This approach enables targeted computational allocation at critical development decision points, overcoming limitations of existing"end-point only"verification methods. Evaluations on SWE-bench Verified demonstrate our extbf{32B model achieves a 46% issue resolution rate}, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1. Additionally, we provide the empirical validation of the test-time scaling phenomenon within SWE agents, revealing that extbf{models dynamically allocate more tokens to increasingly challenging problems}, effectively enhancing reasoning capabilities. We publicly release all training data, models, and code to facilitate future research. https://github.com/yingweima2022/SWE-Reasoner
Problem

Research questions and friction points this paper is trying to address.

Enhancing open-source LLMs for code reasoning performance
Scaling test-time compute instead of using larger models
Improving software engineering agents' multi-stage reasoning processes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Test-Time Compute scaling framework
Development-contextualized trajectory synthesis
Development-process-based search strategy
๐Ÿ”Ž Similar Papers
No similar papers found.
Yingwei Ma
Yingwei Ma
Moonshot AI
LLMCoding Agent
B
Binhua Li
Tongyi Lab, Alibaba Group
Yihong Dong
Yihong Dong
Peking University
Code GenerationLarge Language Models
X
Xue Jiang
Peking University
Rongyu Cao
Rongyu Cao
Chinese Academy of Sciences
data minining
J
Jue Chen
Tongyi Lab, Alibaba Group
F
Fei Huang
Tongyi Lab, Alibaba Group
Y
Yongbin Li
Tongyi Lab, Alibaba Group