Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

📅 2024-07-31
🏛️ arXiv.org
📈 Citations: 58
Influential: 3
📄 PDF
🤖 AI Summary
This work addresses the low reasoning accuracy of language models on tasks with non-automatically-verifiable answers. We propose a novel inference-time computation scaling paradigm based on repeated sampling and automated verification. Specifically, multiple candidate solutions are generated via iterative sampling; valid ones are filtered using code execution or formal verification; and solution reliability is further enhanced via coverage modeling and majority voting or reward-model-based ranking. We empirically discover, for the first time, a logarithmic-linear relationship between solution coverage and sample count—enabling us to formulate an exponential power-law inference-time scaling law. On SWE-bench Lite, DeepSeek-Coder-V2-Instruct achieves a solution rate of 56.0% using 250 samples, up from 15.9% with a single sample—surpassing the single-sample state-of-the-art (43.0%). This demonstrates both the effectiveness and scalability of our approach.

Technology Category

Application Category

📝 Abstract
Scaling the amount of compute used to train language models has dramatically improved their capabilities. However, when it comes to inference, we often limit models to making only one attempt at a problem. Here, we explore inference compute as another axis for scaling, using the simple technique of repeatedly sampling candidate solutions from a model. Across multiple tasks and models, we observe that coverage -- the fraction of problems that are solved by any generated sample -- scales with the number of samples over four orders of magnitude. Interestingly, the relationship between coverage and the number of samples is often log-linear and can be modelled with an exponentiated power law, suggesting the existence of inference-time scaling laws. In domains like coding and formal proofs, where answers can be automatically verified, these increases in coverage directly translate into improved performance. When we apply repeated sampling to SWE-bench Lite, the fraction of issues solved with DeepSeek-Coder-V2-Instruct increases from 15.9% with one sample to 56% with 250 samples, outperforming the single-sample state-of-the-art of 43%. In domains without automatic verifiers, we find that common methods for picking from a sample collection (majority voting and reward models) plateau beyond several hundred samples and fail to fully scale with the sample budget.
Problem

Research questions and friction points this paper is trying to address.

Language Model
Efficiency
Accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sampling Amplification
Log-linear Relationship
Problem-solving Enhancement
🔎 Similar Papers
No similar papers found.