🤖 AI Summary
The scarcity of large-scale, high-quality, human-annotated mathematical proof datasets hinders the advancement of large language models (LLMs) in proof generation.
Method: We construct the first human-annotated dataset of over 5,000 formal and natural-language proofs covering high-difficulty competition problems (e.g., USAMO/IMO), introducing a multidimensional human evaluation framework. Using an 8B-parameter model, we perform supervised fine-tuning and conduct systematic analysis of performance gaps between natural-language and formal proofs.
Contribution/Results: We empirically reveal a substantial performance gap between natural-language and formal proof generation; demonstrate that answer correctness is poorly correlated with full-proof validity; and establish the critical role of best-of-n sampling in improving proof quality. Our fine-tuned model matches Gemini-2.5-Pro on proof correctness classification. The dataset is publicly released, establishing a new benchmark for mathematical reasoning.
📝 Abstract
In recent months, large language models (LLMs) have made significant progress in mathematical proof generation, but further advancement is hindered by the lack of a large-scale, high-quality dataset of human-evaluated proofs. While expensive to create, such a dataset is essential for driving improvements in training and enabling a rigorous analysis of proof generation capabilities. In this work, we present the Open Proof Corpus (OPC), a dataset comprising over 5,000 human-evaluated proofs produced by state-of-the-art LLMs. The OPC was specifically designed for broad applicability and downstream usage in proof generation research and is the first to include a substantial number of correct, LLM-generated solutions to problems from prestigious mathematics competitions such as the USAMO and IMO. Using the OPC, we explore critical questions in automated proof generation: (1) the performance gap between natural language and formal proof generation, (2) the discrepancy between final-answer accuracy and full-proof validity, and (3) the impact of best-of-n selection on proof quality. Finally, to showcase the utility of the OPC, we finetune an 8B-parameter model on the dataset, obtaining a model that performs on par with the best model, Gemini-2.5-Pro, on the task of evaluating proof correctness.