Reliable Fine-Grained Evaluation of Natural Language Math Proofs

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing large language models (LLMs) generate natural-language mathematical proofs, yet lack reliable, fine-grained automatic evaluation methods—hindering progress in proof generation and verification. Method: We propose ProofGrader, the first fine-grained automatic evaluation framework for mathematical proofs. It introduces ProofBench, a human-annotated benchmark with multidimensional 0–7 scale scores, and a multi-source assessment pipeline integrating strong reasoning LLMs, reference solutions, and structured rubrics, enhanced by ensemble strategies to improve scoring consistency. Contribution/Results: Experiments show ProofGrader achieves a mean absolute error (MAE) of only 0.926 against expert human graders. In best-of-16 proof selection, it recovers 78% of the performance gap relative to human selection—substantially outperforming existing baselines. This work establishes a reproducible, high-fidelity evaluation infrastructure for mathematical proof generation and validation.

Technology Category

Application Category

📝 Abstract

Recent advances in large language models (LLMs) for mathematical reasoning have largely focused on tasks with easily verifiable final answers; however, generating and verifying natural language math proofs remains an open challenge. We identify the absence of a reliable, fine-grained evaluator for LLM-generated math proofs as a critical gap. To address this, we propose a systematic methodology for developing and validating evaluators that assign fine-grained scores on a 0-7 scale to model-generated math proofs. To enable this study, we introduce ProofBench, the first expert-annotated dataset of fine-grained proof ratings, spanning 145 problems from six major math competitions (USAMO, IMO, Putnam, etc) and 435 LLM-generated solutions from Gemini-2.5-pro, o3, and DeepSeek-R1. %with expert gradings. Using ProofBench as a testbed, we systematically explore the evaluator design space across key axes: the backbone model, input context, instructions and evaluation workflow. Our analysis delivers ProofGrader, an evaluator that combines a strong reasoning backbone LM, rich context from reference solutions and marking schemes, and a simple ensembling method; it achieves a low Mean Absolute Error (MAE) of 0.926 against expert scores, significantly outperforming naive baselines. Finally, we demonstrate its practical utility in a best-of-$n$ selection task: at $n=16$, ProofGrader achieves an average score of 4.14 (out of 7), closing 78% of the gap between a naive binary evaluator (2.48) and the human oracle (4.62), highlighting its potential to advance downstream proof generation.

Problem

Research questions and friction points this paper is trying to address.

Evaluating natural language math proofs from large language models

Developing fine-grained scoring methodology for proof quality assessment

Creating expert-annotated benchmark dataset for mathematical proof evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed expert-annotated ProofBench dataset for fine-grained proof ratings

Created ProofGrader evaluator combining strong reasoning LM with rich context

Achieved low MAE through ensembling method and systematic workflow design

🔎 Similar Papers

Autograding Mathematical Induction Proofs with Natural Language Processing

2024-06-11arXiv.orgCitations: 0

Evaluating Mathematical Reasoning Beyond Accuracy

2024-04-08arXiv.orgCitations: 6

Authors to Follow