The Open Proof Corpus: A Large-Scale Study of LLM-Generated Mathematical Proofs

📅 2025-06-23

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

The scarcity of large-scale, high-quality, human-annotated mathematical proof datasets hinders the advancement of large language models (LLMs) in proof generation. Method: We construct the first human-annotated dataset of over 5,000 formal and natural-language proofs covering high-difficulty competition problems (e.g., USAMO/IMO), introducing a multidimensional human evaluation framework. Using an 8B-parameter model, we perform supervised fine-tuning and conduct systematic analysis of performance gaps between natural-language and formal proofs. Contribution/Results: We empirically reveal a substantial performance gap between natural-language and formal proof generation; demonstrate that answer correctness is poorly correlated with full-proof validity; and establish the critical role of best-of-n sampling in improving proof quality. Our fine-tuned model matches Gemini-2.5-Pro on proof correctness classification. The dataset is publicly released, establishing a new benchmark for mathematical reasoning.

Technology Category

Application Category

📝 Abstract

In recent months, large language models (LLMs) have made significant progress in mathematical proof generation, but further advancement is hindered by the lack of a large-scale, high-quality dataset of human-evaluated proofs. While expensive to create, such a dataset is essential for driving improvements in training and enabling a rigorous analysis of proof generation capabilities. In this work, we present the Open Proof Corpus (OPC), a dataset comprising over 5,000 human-evaluated proofs produced by state-of-the-art LLMs. The OPC was specifically designed for broad applicability and downstream usage in proof generation research and is the first to include a substantial number of correct, LLM-generated solutions to problems from prestigious mathematics competitions such as the USAMO and IMO. Using the OPC, we explore critical questions in automated proof generation: (1) the performance gap between natural language and formal proof generation, (2) the discrepancy between final-answer accuracy and full-proof validity, and (3) the impact of best-of-n selection on proof quality. Finally, to showcase the utility of the OPC, we finetune an 8B-parameter model on the dataset, obtaining a model that performs on par with the best model, Gemini-2.5-Pro, on the task of evaluating proof correctness.

Problem

Research questions and friction points this paper is trying to address.

Lack of large-scale human-evaluated LLM-generated proofs dataset

Performance gap between natural and formal proof generation

Discrepancy between answer accuracy and proof validity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale human-evaluated LLM-generated proofs dataset

Includes correct solutions from prestigious math competitions

Fine-tuned model matches top performance in proof evaluation

🔎 Similar Papers

No similar papers found.