Do We Need Frontier Models to Verify Mathematical Proofs?

📅 2026-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether natural language mathematical proof verification necessarily requires state-of-the-art large language models (LLMs), systematically evaluating the accuracy and self-consistency of both open-source and leading proprietary LLMs on competition-level proof verification tasks. Through multi-model comparisons, self-consistency metrics, and LLM-guided prompt search, the authors propose a specialized prompt ensembling method that substantially enhances the performance of smaller models. Experimental results demonstrate that optimized smaller models—such as Qwen2-35B—achieve up to a 9.1% improvement in accuracy and a 15.9% gain in self-consistency, reaching verification performance comparable to that of cutting-edge models like Gemini 1.5 Pro. These findings reveal that smaller models possess underappreciated potential for high-level mathematical reasoning tasks.
📝 Abstract
Advances in training, post-training, and inference-time methods have enabled frontier reasoning models to win gold medals in math competitions and settle challenging open problems. Gaining trust in the responses of these models requires that natural language proofs be checked for errors. LLM judges are increasingly being adopted to meet the growing demand for evaluating such proofs. While verification is considered easier than generation, what model capability does reliable verification actually require? We systematically evaluate four open-source and two frontier LLMs on datasets of human-graded natural language proofs of competition-level problems. We consider two key metrics: verifier accuracy and self-consistency (the rate of agreement across repeated judgments on the same proof). We observe that smaller open-source models are only up to ~10% behind frontier models in accuracy but they are up to ~25% more inconsistent. Furthermore, we see that verifier accuracy is sensitive to prompt choice across all models. We then demonstrate that the smaller models, in fact, do possess the mathematical capabilities to verify proofs at the level of frontier models, but they struggle to reliably elicit these capabilities with general judging prompts. Through an LLM-guided prompt search, we synthesize an ensemble of specialized prompts that overcome the specific failure modes of smaller models, boosting their performance by up to 9.1% in accuracy and 15.9% in self-consistency. These gains are realized across models and datasets, allowing models like Qwen3.5-35B to perform on par with frontier models such as Gemini 3.1 Pro for proof verification.
Problem

Research questions and friction points this paper is trying to address.

proof verification
large language models
model consistency
mathematical reasoning
verifier accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

proof verification
prompt engineering
self-consistency
mathematical reasoning
LLM evaluation
🔎 Similar Papers
No similar papers found.