🤖 AI Summary
Large language models (LLMs) exhibit weak default self-verification capabilities, limiting their reliability in reasoning tasks. Method: This paper systematically investigates the scaling laws of sampling-based search during inference and identifies that increasing random sample size significantly improves self-verification accuracy. We propose two principled enhancements: cross-response contrastive error correction and output-style adaptation—integrated with chain-of-thought reasoning, concise-style generation, and cross-response consistency analysis to form an end-to-end self-verification framework. Crucially, our approach requires no additional training or external tools, relying solely on test-time computation. Contribution/Results: Evaluated on Gemini v1.5 Pro, our method achieves higher reasoning accuracy than the o1-Preview baseline. Furthermore, we introduce V-Bench—the first benchmark specifically designed to evaluate LLMs’ intrinsic self-verification capability—enabling standardized assessment and future advancement in this direction.
📝 Abstract
Sampling-based search, a simple paradigm for utilizing test-time compute, involves generating multiple candidate responses and selecting the best one -- typically by verifying each response for correctness. In this paper, we study the scaling trends governing sampling-based search. Among our findings is that simply scaling up a minimalist implementation that uses only random sampling and direct self-verification results in sustained performance improvements that, for example, elevate the Gemini v1.5 Pro model's reasoning capabilities past that of o1-Preview on popular benchmarks. We partially attribute the scalability of sampling-based search to a phenomenon of implicit scaling, where sampling a larger pool of responses in turn improves verification accuracy. We further identify two useful principles for improving self-verification capabilities with test-time compute: (1) comparing across responses provides helpful signals about the locations of errors and hallucinations, and (2) different model output styles are useful for different contexts -- chains of thought are useful for reasoning but harder to verify. We also find that, though accurate verification can be elicited, frontier models demonstrate remarkably weak out-of-box verification capabilities and introduce a benchmark to measure progress on these deficiencies.