Reliable Fine-Grained Evaluation of Natural Language Math Proofs

📅 2025-10-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing large language models (LLMs) generate natural-language mathematical proofs, yet lack reliable, fine-grained automatic evaluation methods—hindering progress in proof generation and verification. Method: We propose ProofGrader, the first fine-grained automatic evaluation framework for mathematical proofs. It introduces ProofBench, a human-annotated benchmark with multidimensional 0–7 scale scores, and a multi-source assessment pipeline integrating strong reasoning LLMs, reference solutions, and structured rubrics, enhanced by ensemble strategies to improve scoring consistency. Contribution/Results: Experiments show ProofGrader achieves a mean absolute error (MAE) of only 0.926 against expert human graders. In best-of-16 proof selection, it recovers 78% of the performance gap relative to human selection—substantially outperforming existing baselines. This work establishes a reproducible, high-fidelity evaluation infrastructure for mathematical proof generation and validation.

Technology Category

Application Category

📝 Abstract
Recent advances in large language models (LLMs) for mathematical reasoning have largely focused on tasks with easily verifiable final answers; however, generating and verifying natural language math proofs remains an open challenge. We identify the absence of a reliable, fine-grained evaluator for LLM-generated math proofs as a critical gap. To address this, we propose a systematic methodology for developing and validating evaluators that assign fine-grained scores on a 0-7 scale to model-generated math proofs. To enable this study, we introduce ProofBench, the first expert-annotated dataset of fine-grained proof ratings, spanning 145 problems from six major math competitions (USAMO, IMO, Putnam, etc) and 435 LLM-generated solutions from Gemini-2.5-pro, o3, and DeepSeek-R1. %with expert gradings. Using ProofBench as a testbed, we systematically explore the evaluator design space across key axes: the backbone model, input context, instructions and evaluation workflow. Our analysis delivers ProofGrader, an evaluator that combines a strong reasoning backbone LM, rich context from reference solutions and marking schemes, and a simple ensembling method; it achieves a low Mean Absolute Error (MAE) of 0.926 against expert scores, significantly outperforming naive baselines. Finally, we demonstrate its practical utility in a best-of-$n$ selection task: at $n=16$, ProofGrader achieves an average score of 4.14 (out of 7), closing 78% of the gap between a naive binary evaluator (2.48) and the human oracle (4.62), highlighting its potential to advance downstream proof generation.
Problem

Research questions and friction points this paper is trying to address.

Evaluating natural language math proofs from large language models
Developing fine-grained scoring methodology for proof quality assessment
Creating expert-annotated benchmark dataset for mathematical proof evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed expert-annotated ProofBench dataset for fine-grained proof ratings
Created ProofGrader evaluator combining strong reasoning LM with rich context
Achieved low MAE through ensembling method and systematic workflow design
W
Wenjie Ma
UC Berkeley
A
Andrei Cojocaru
UC Berkeley
Neel Kolhe
Neel Kolhe
Student
Theoretical Computer ScienceAlgorithmsKnot theoryReinforcement Learning
B
Bradley Louie
UC Berkeley
R
Robin Said Sharif
UC Berkeley
H
Haihan Zhang
Peking University
V
Vincent Zhuang
Google DeepMind
Matei Zaharia
Matei Zaharia
UC Berkeley and Databricks
Distributed SystemsMachine LearningDatabasesSecurity
Sewon Min
Sewon Min
UC Berkeley EECS & Allen Institute for AI
Natural Language ProcessingMachine Learning