CodeMonkeys: Scaling Test-Time Compute for Software Engineering

📅 2025-01-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited performance of large language models (LLMs) on real-world GitHub issue resolution (SWE-bench). We propose a novel test-time compute scaling paradigm that integrates serial iterative code editing with parallel multi-trajectory Monte Carlo sampling, forming a closed loop of code editing, execution feedback, and test validation. We introduce a model-self-generated test-driven voting and multi-round selection mechanism to enable cross-trajectory candidate edit fusion, and incorporate context-aware file traversal to enhance code understanding fidelity. On the SWE-bench Verified benchmark, our standalone system achieves a 57.4% resolution rate under a $2,300 budget; when ensembled with existing state-of-the-art methods, performance improves to 66.2%, substantially outperforming all baselines. This is the first systematic study to empirically demonstrate and validate the efficacy of test-time compute scaling for enhancing LLMs’ software engineering capabilities.

Technology Category

Application Category

📝 Abstract
Scaling test-time compute is a promising axis for improving LLM capabilities. However, test-time compute can be scaled in a variety of ways, and effectively combining different approaches remains an active area of research. Here, we explore this problem in the context of solving real-world GitHub issues from the SWE-bench dataset. Our system, named CodeMonkeys, allows models to iteratively edit a codebase by jointly generating and running a testing script alongside their draft edit. We sample many of these multi-turn trajectories for every issue to generate a collection of candidate edits. This approach lets us scale"serial"test-time compute by increasing the number of iterations per trajectory and"parallel"test-time compute by increasing the number of trajectories per problem. With parallel scaling, we can amortize up-front costs across multiple downstream samples, allowing us to identify relevant codebase context using the simple method of letting an LLM read every file. In order to select between candidate edits, we combine voting using model-generated tests with a final multi-turn trajectory dedicated to selection. Overall, CodeMonkeys resolves 57.4% of issues from SWE-bench Verified using a budget of approximately 2300 USD. Our selection method can also be used to combine candidates from different sources. Selecting over an ensemble of edits from existing top SWE-bench Verified submissions obtains a score of 66.2% and outperforms the best member of the ensemble on its own. We fully release our code and data at https://scalingintelligence.stanford.edu/pubs/codemonkeys.
Problem

Research questions and friction points this paper is trying to address.

Software Engineering
Large Language Models
GitHub Issues
Innovation

Methods, ideas, or system contributions that make the work stand out.

CodeMonkeys system
Iterative code modification
Parallel processing
🔎 Similar Papers
No similar papers found.