RLMEval: Evaluating Research-Level Neural Theorem Proving

📅 2025-10-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language models (LLMs) exhibit limited practical efficacy in research-level neural theorem proving and automated formalization. Method: We introduce the first evaluation benchmark tailored to authentic research scenarios—the Lean Blueprint Benchmark—comprising 613 complex mathematical theorems drawn from the Lean Blueprint corpus, with rigorous integration of formal verification and automated execution to quantify model performance via pass rate. Contribution/Results: Experiments reveal that state-of-the-art models achieve only a 10.3% pass rate, exposing critical bottlenecks in higher-order mathematical reasoning and cross-theorem formalization generalization. This work establishes a more realistic, challenging evaluation standard and sets a new foundational benchmark for advancing automated formal mathematics, while identifying concrete directions for model improvement in neural theorem proving and formalization.

Technology Category

Application Category

📝 Abstract
Despite impressive results on curated benchmarks, the practical impact of large language models (LLMs) on research-level neural theorem proving and proof autoformalization is still limited. We introduce RLMEval, an evaluation suite for these tasks, focusing on research-level mathematics from real-world Lean formalization projects. RLMEval targets the evaluation of neural theorem proving and proof autoformalization on challenging research-level theorems by leveraging real Lean Blueprint formalization projects. Our evaluation of state-of-the-art models on RLMEval, comprising 613 theorems from 6 Lean projects, reveals a significant gap: progress on existing benchmarks does not readily translate to these more realistic settings, with the best model achieving only a 10.3 % pass rate. RLMEval provides a new, challenging benchmark designed to guide and accelerate progress in automated reasoning for formal mathematics.
Problem

Research questions and friction points this paper is trying to address.

Evaluating neural theorem proving on research-level mathematics
Assessing proof autoformalization using real Lean projects
Bridging the performance gap in automated reasoning benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating theorem proving with real Lean projects
Creating benchmark from research-level mathematics theorems
Assessing autoformalization on challenging formalization tasks
🔎 Similar Papers
No similar papers found.