The Open Proof Corpus: A Large-Scale Study of LLM-Generated Mathematical Proofs

📅 2025-06-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The scarcity of large-scale, high-quality, human-annotated mathematical proof datasets hinders the advancement of large language models (LLMs) in proof generation. Method: We construct the first human-annotated dataset of over 5,000 formal and natural-language proofs covering high-difficulty competition problems (e.g., USAMO/IMO), introducing a multidimensional human evaluation framework. Using an 8B-parameter model, we perform supervised fine-tuning and conduct systematic analysis of performance gaps between natural-language and formal proofs. Contribution/Results: We empirically reveal a substantial performance gap between natural-language and formal proof generation; demonstrate that answer correctness is poorly correlated with full-proof validity; and establish the critical role of best-of-n sampling in improving proof quality. Our fine-tuned model matches Gemini-2.5-Pro on proof correctness classification. The dataset is publicly released, establishing a new benchmark for mathematical reasoning.

Technology Category

Application Category

📝 Abstract
In recent months, large language models (LLMs) have made significant progress in mathematical proof generation, but further advancement is hindered by the lack of a large-scale, high-quality dataset of human-evaluated proofs. While expensive to create, such a dataset is essential for driving improvements in training and enabling a rigorous analysis of proof generation capabilities. In this work, we present the Open Proof Corpus (OPC), a dataset comprising over 5,000 human-evaluated proofs produced by state-of-the-art LLMs. The OPC was specifically designed for broad applicability and downstream usage in proof generation research and is the first to include a substantial number of correct, LLM-generated solutions to problems from prestigious mathematics competitions such as the USAMO and IMO. Using the OPC, we explore critical questions in automated proof generation: (1) the performance gap between natural language and formal proof generation, (2) the discrepancy between final-answer accuracy and full-proof validity, and (3) the impact of best-of-n selection on proof quality. Finally, to showcase the utility of the OPC, we finetune an 8B-parameter model on the dataset, obtaining a model that performs on par with the best model, Gemini-2.5-Pro, on the task of evaluating proof correctness.
Problem

Research questions and friction points this paper is trying to address.

Lack of large-scale human-evaluated LLM-generated proofs dataset
Performance gap between natural and formal proof generation
Discrepancy between answer accuracy and proof validity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale human-evaluated LLM-generated proofs dataset
Includes correct solutions from prestigious math competitions
Fine-tuned model matches top performance in proof evaluation
🔎 Similar Papers
No similar papers found.
Jasper Dekoninck
Jasper Dekoninck
PhD Student, ETH Zurich
large language modelsquantum computingevaluation
Ivo Petrov
Ivo Petrov
PhD student, INSAIT, Sofia University
Gradient LeakageLLM Reasoning
Kristian Minchev
Kristian Minchev
PhD student, INSAIT, Sofia University
machine learning
M
Mislav Balunovic
ETH Zurich
Martin Vechev
Martin Vechev
Full Professor of Computer Science, ETH Zurich; Scientific Director, INSAIT, Sofia University
Programming LanguagesMachine LearningSecurity
M
Miroslav Marinov
Institute of Mathematics and Informatics, Bulgarian Academy of Sciences
Maria Drencheva
Maria Drencheva
INSAIT, Sofia University "St.Kliment Ohridski"
L
Lyuba Konova
Sofia University "St. Kliment Ohridski"
M
Milen Shumanov
INSAIT, Sofia University "St. Kliment Ohridski"
K
Kaloyan Tsvetkov
INSAIT, Sofia University "St. Kliment Ohridski"
N
Nikolay Drenchev
INSAIT, Sofia University "St. Kliment Ohridski"
L
Lazar Todorov
INSAIT, Sofia University "St. Kliment Ohridski"
K
Kalina Nikolova
INSAIT, Sofia University "St. Kliment Ohridski"
N
Nikolay Georgiev
INSAIT, Sofia University "St. Kliment Ohridski"
V
Vanesa Kalinkova
INSAIT, Sofia University "St. Kliment Ohridski"
M
Margulan Ismoldayev
Massachusetts Institute of Technology