When Does Verification Pay Off? A Closer Look at LLMs as Solution Verifiers

📅 2025-12-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The effectiveness of large language models (LLMs) as verifiers of reasoning outputs remains poorly understood, particularly regarding model- and task-specific dependencies. Method: We conduct a large-scale empirical study across 9 benchmarks and 37 models spanning diverse architectures, parameter scales, and post-training stages, introducing the “verification gain” metric to quantify performance improvement from verification and analyzing its correlation with task verifiability. Contribution/Results: (1) Cross-family verification consistently outperforms self-verification; (2) mathematical and logical reasoning tasks exhibit the highest verifiability; (3) post-training diminishes self-verification capability but enhances correction of external solutions; (4) rejection sampling combined with cross-family verification robustly improves reasoning accuracy. This work provides the first systematic characterization of verification efficacy’s dependence on both model architecture and task semantics, establishing theoretical foundations and practical guidelines for building reliable LLM-based reasoning systems.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) can act as both problem solvers and solution verifiers, with verifiers improving solver performance by selecting high-quality answers from a pool of candidates. However, prior studies of solver-verifier interactions have been limited, focusing mainly on self-verification and rarely examining how verifiers judge outputs from models in their own or in another model family. Modern LLMs also undergo extensive post-training, but its effect on verification remains unclear. We present a systematic study across 37 models spanning multiple families, sizes, and base vs. post-trained variants, evaluated on 9 benchmarks covering logical reasoning, structured puzzles, symbolic computation, mathematics, commonsense, factual recall, and domain knowledge. We compare self-verification with verification within the same family and across different families. To support this, we introduce and empirically validate verifier gain, a metric that predicts the performance improvements from test-time verifier-based rejection sampling. We analyze how metrics like verifier gain and false positive rate scale with model size and post-training, and characterize differences in dataset verifiability. Our findings show that cross-family verification is especially effective; post-training reduces self-improvement but strengthens cross-family improvement; and mathematical and logical tasks exhibit the highest inherent verifiability.
Problem

Research questions and friction points this paper is trying to address.

Investigates when LLM verification improves solution quality
Compares self, same-family, and cross-family verification effectiveness
Analyzes impact of model size and post-training on verification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-family verification improves solution selection
Verifier gain metric predicts performance from rejection sampling
Post-training reduces self-verification but aids cross-model verification
🔎 Similar Papers
No similar papers found.
Jack Lu
Jack Lu
New York University
Machine LearningDeep LearningGenerative Modeling
R
R. Teehan
Agentic Learning AI Lab, New York University
J
Jinran Jin
Agentic Learning AI Lab, New York University
Mengye Ren
Mengye Ren
NYU
Machine LearningComputer VisionArtificial Intelligence