🤖 AI Summary
This work addresses the absence of high-quality, contamination-free formal benchmarks for research-level mathematical problems, which has hindered accurate evaluation of automated reasoning systems in genuine mathematical discovery. The authors introduce the first continuously evolving Lean 4 formalized mathematics benchmark, comprising 2,615 problems—including 1,029 open conjectures—designed to enable clean evaluation and foster human–AI collaborative verification. By integrating community collaboration, a standardized evaluation subset, and an AI-driven proof auditing pipeline, the benchmark supports rigorous, reproducible assessment. It has already facilitated multiple novel mathematical discoveries, including resolutions of several open conjectures, and provides reproducible baseline results that clearly delineate the current frontier of automated reasoning in research-level mathematics.
📝 Abstract
As automated reasoning systems advance rapidly, there is a growing need for research-level formal mathematical problems to accurately evaluate their capabilities. To address this, we present Formal Conjectures, an evolving benchmark of currently 2615 mathematical problem statements formalized in Lean 4. Sourced from areas of active mathematical research, the dataset features 1029 open research conjectures providing a zero-contamination benchmark for mathematical proof discovery, and 836 solved problems for proof autoformalization. Notably, the repository provides a structured interface connecting mathematicians who formalize and clarify problems with the AI systems and humans attempting to solve them. Demonstrating its immediate utility, the benchmark has already been leveraged to make new mathematical discoveries, including the resolution of open research conjectures. We describe our approach to ensuring the correctness of these formalizations in a collaborative open-source project where contributions stem from an active community. In this framework, AI-generated proofs and disproofs serve as a valuable auditing mechanism to iteratively improve the fidelity of the benchmark. Finally, we provide a standardized evaluation setup and report baseline results on frozen evaluation subsets, demonstrating a climbable signal that measures the current frontier of automated reasoning on research-level mathematics.