Scaling Generative Verifiers For Natural Language Mathematical Proof Verification And Selection

๐Ÿ“… 2025-11-17
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Contemporary large language models (LLMs) often produce correct final answers to mathematical problems but lack mathematical rigor in their reasoning steps; moreover, reliance on single-benchmark evaluation yields misleading performance assessments. Method: We propose a multidimensional evaluation framework to expose the limitations of unidimensional benchmarks and introduce a generative verification approach integrating GenSelect with LLM-as-a-Judge, validated on a million-token mathematical reasoning corpus. We further analyze the impact of reinforcement learning on prompt sensitivity and answer accuracy. Contribution/Results: We establish the first robust proof verification framework jointly assessing both reasoning process validity and answer correctness. Empirical analysis reveals that current LLMs prioritize syntactic correctness over semantic mathematical validity. Our work provides a systematic, scalable, and trustworthy methodology for evaluating mathematical proofsโ€”offering concrete guidelines for rigorous, reproducible assessment of LLM-based mathematical reasoning.

Technology Category

Application Category

๐Ÿ“ Abstract
Large language models have achieved remarkable success on final-answer mathematical problems, largely due to the ease of applying reinforcement learning with verifiable rewards. However, the reasoning underlying these solutions is often flawed. Advancing to rigorous proof-based mathematics requires reliable proof verification capabilities. We begin by analyzing multiple evaluation setups and show that focusing on a single benchmark can lead to brittle or misleading conclusions. To address this, we evaluate both proof-based and final-answer reasoning to obtain a more reliable measure of model performance. We then scale two major generative verification methods (GenSelect and LLM-as-a-Judge) to millions of tokens and identify their combination as the most effective framework for solution verification and selection. We further show that the choice of prompt for LLM-as-a-Judge significantly affects the model's performance, but reinforcement learning can reduce this sensitivity. However, despite improving proof-level metrics, reinforcement learning does not enhance final-answer precision, indicating that current models often reward stylistic or procedural correctness rather than mathematical validity. Our results establish practical guidelines for designing and evaluating scalable proof-verification and selection systems.
Problem

Research questions and friction points this paper is trying to address.

Developing reliable proof verification capabilities for rigorous mathematical reasoning
Scaling generative verification methods for solution verification and selection
Addressing sensitivity to prompts in LLM-based mathematical proof evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scaling generative verification methods to millions of tokens
Combining GenSelect and LLM-as-a-Judge for solution verification
Using reinforcement learning to reduce prompt sensitivity
๐Ÿ”Ž Similar Papers