Shrinking the Generation-Verification Gap with Weak Verifiers

📅 2025-06-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the capability gap between language model generation and verifier evaluation, this paper proposes Weaver: a framework that estimates the accuracy of heterogeneous weak verifiers via weak supervision, then fuses them efficiently through weighted ensemble, output normalization, and candidate filtering. Furthermore, it employs distillation using a 400M-parameter cross-encoder to enhance computational efficiency while reducing reliance on labeled data and large foundation models. Evaluated on mathematical reasoning and complex reasoning tasks, Weaver significantly improves multi-candidate response selection accuracy—achieving Pass@1 of 87.7% with Llama 3.3 70B, matching o3-mini’s performance and enabling a leap in verification accuracy from GPT-4o to o3-mini scale. The core contribution is a highly robust, low-cost strong verifier constructed without strong supervision, offering scalable, annotation-efficient verification for open-ended generation.

Technology Category

Application Category

📝 Abstract
Verifiers can improve language model capabilities by scoring and ranking responses from generated candidates. Currently, high-quality verifiers are either unscalable (e.g., humans) or limited in utility (e.g., tools like Lean). While LM judges and reward models have become broadly useful as general-purpose verifiers, a significant performance gap remains between them and oracle verifiers (verifiers with perfect accuracy). To help close this gap, we introduce Weaver, a framework for designing a strong verifier by combining multiple weak, imperfect verifiers. We find weighted ensembles of verifiers, which typically require learning from labeled data, significantly outperform unweighted combinations due to differences in verifier accuracies. To reduce dependency on labeled data, Weaver leverages weak supervision to estimate each verifier's accuracy and combines outputs into a unified score that better reflects true response quality. However, directly applying weak supervision algorithms poses challenges, including inconsistent verifier output formats and handling low-quality verifiers. Weaver addresses these using dataset statistics to normalize outputs and filter specific verifiers. We study Weaver's effectiveness in test-time repeated sampling, where a model generates multiple candidate responses and selects one. Our evaluations show Weaver significantly improves over Pass@1-performance when selecting the first candidate-across reasoning and math tasks, achieving o3-mini-level accuracy with Llama 3.3 70B Instruct as generator, and an ensemble of 70B or smaller judge and reward models as verifiers (87.7% average). This gain mirrors the jump between GPT-4o and o3-mini (69.0% vs. 86.7%), which required extensive finetuning and post-training. To reduce computational costs of verifier ensembles, we train a 400M cross-encoder using Weaver's combined output scores.
Problem

Research questions and friction points this paper is trying to address.

Bridging performance gap between weak and oracle verifiers
Combining multiple imperfect verifiers without labeled data
Improving response selection accuracy in language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines multiple weak verifiers into strong ensembles
Uses weak supervision to estimate verifier accuracy
Normalizes outputs and filters low-quality verifiers
🔎 Similar Papers
No similar papers found.