LemmaBench: A Live, Research-Level Benchmark to Evaluate LLM Capabilities in Mathematics

📅 2026-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing mathematical evaluation benchmarks predominantly rely on static problem sets, which inadequately assess the reasoning and proof capabilities of large language models in authentic mathematical research contexts. This work proposes the first dynamically updatable benchmark, which employs an automated pipeline to extract lemmas from recent arXiv papers, supplement missing definitions, and reformulate them into self-contained problems. The resulting high-quality evaluation suite closely mirrors real-world research scenarios while effectively mitigating train–test data contamination. Experimental results reveal that even state-of-the-art large language models achieve only 10–15% pass@1 accuracy on theorem proving within this benchmark, underscoring a significant gap in their capacity for research-level mathematical reasoning.

Technology Category

Application Category

📝 Abstract
We present a new approach for benchmarking Large Language Model (LLM) capabilities on research-level mathematics. Existing benchmarks largely rely on static, hand-curated sets of contest or textbook-style problems as proxies for mathematical research. Instead, we establish an updatable benchmark evaluating models directly on the latest research results in mathematics. This consists of an automatic pipeline that extracts lemmas from arXiv and rewrites them into self-contained statements by making all assumptions and required definitions explicit. It results in a benchmark that can be updated regularly with new problems taken directly from human mathematical research, while previous instances can be used for training without compromising future evaluations. We benchmark current state-of-the-art LLMs, which obtain around 10-15$\%$ accuracy in theorem proving (pass@1) depending on the model, showing that there is currently a large margin of progression for LLMs to reach human-level proving capabilities in a research context.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
mathematical reasoning
theorem proving
benchmarking
research-level mathematics
Innovation

Methods, ideas, or system contributions that make the work stand out.

LemmaBench
research-level mathematics
automatic lemma extraction
self-contained theorem statements
updatable benchmark
🔎 Similar Papers
No similar papers found.
A
Antoine Peyronnet
ENS Rennes, École des Ponts, IP Paris
F
Fabian Gloeckle
École des Ponts, IP Paris
Amaury Hayat
Amaury Hayat
Professor, Ecole des Ponts Paristech, CERMICS
Control TheoryPartial Differential EquationsAI for mathsTraffic flowsOptical forces