Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification

📅 2025-02-03

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Large language models (LLMs) exhibit weak default self-verification capabilities, limiting their reliability in reasoning tasks. Method: This paper systematically investigates the scaling laws of sampling-based search during inference and identifies that increasing random sample size significantly improves self-verification accuracy. We propose two principled enhancements: cross-response contrastive error correction and output-style adaptation—integrated with chain-of-thought reasoning, concise-style generation, and cross-response consistency analysis to form an end-to-end self-verification framework. Crucially, our approach requires no additional training or external tools, relying solely on test-time computation. Contribution/Results: Evaluated on Gemini v1.5 Pro, our method achieves higher reasoning accuracy than the o1-Preview baseline. Furthermore, we introduce V-Bench—the first benchmark specifically designed to evaluate LLMs’ intrinsic self-verification capability—enabling standardized assessment and future advancement in this direction.

Technology Category

Application Category

📝 Abstract

Sampling-based search, a simple paradigm for utilizing test-time compute, involves generating multiple candidate responses and selecting the best one -- typically by verifying each response for correctness. In this paper, we study the scaling trends governing sampling-based search. Among our findings is that simply scaling up a minimalist implementation that uses only random sampling and direct self-verification results in sustained performance improvements that, for example, elevate the Gemini v1.5 Pro model's reasoning capabilities past that of o1-Preview on popular benchmarks. We partially attribute the scalability of sampling-based search to a phenomenon of implicit scaling, where sampling a larger pool of responses in turn improves verification accuracy. We further identify two useful principles for improving self-verification capabilities with test-time compute: (1) comparing across responses provides helpful signals about the locations of errors and hallucinations, and (2) different model output styles are useful for different contexts -- chains of thought are useful for reasoning but harder to verify. We also find that, though accurate verification can be elicited, frontier models demonstrate remarkably weak out-of-box verification capabilities and introduce a benchmark to measure progress on these deficiencies.

Problem

Research questions and friction points this paper is trying to address.

Scaling sampling-based search efficiency

Improving self-verification with test-time compute

Assessing model verification capabilities on benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scaling sampling-based search

Improving self-verification accuracy

Utilizing diverse model outputs

🔎 Similar Papers

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling