🤖 AI Summary
Current large language models (LLMs) exhibit limited practical efficacy in research-level neural theorem proving and automated formalization. Method: We introduce the first evaluation benchmark tailored to authentic research scenarios—the Lean Blueprint Benchmark—comprising 613 complex mathematical theorems drawn from the Lean Blueprint corpus, with rigorous integration of formal verification and automated execution to quantify model performance via pass rate. Contribution/Results: Experiments reveal that state-of-the-art models achieve only a 10.3% pass rate, exposing critical bottlenecks in higher-order mathematical reasoning and cross-theorem formalization generalization. This work establishes a more realistic, challenging evaluation standard and sets a new foundational benchmark for advancing automated formal mathematics, while identifying concrete directions for model improvement in neural theorem proving and formalization.
📝 Abstract
Despite impressive results on curated benchmarks, the practical impact of large language models (LLMs) on research-level neural theorem proving and proof autoformalization is still limited. We introduce RLMEval, an evaluation suite for these tasks, focusing on research-level mathematics from real-world Lean formalization projects. RLMEval targets the evaluation of neural theorem proving and proof autoformalization on challenging research-level theorems by leveraging real Lean Blueprint formalization projects. Our evaluation of state-of-the-art models on RLMEval, comprising 613 theorems from 6 Lean projects, reveals a significant gap: progress on existing benchmarks does not readily translate to these more realistic settings, with the best model achieving only a 10.3 % pass rate. RLMEval provides a new, challenging benchmark designed to guide and accelerate progress in automated reasoning for formal mathematics.