🤖 AI Summary
Existing mathematical evaluation benchmarks predominantly rely on static problem sets, which inadequately assess the reasoning and proof capabilities of large language models in authentic mathematical research contexts. This work proposes the first dynamically updatable benchmark, which employs an automated pipeline to extract lemmas from recent arXiv papers, supplement missing definitions, and reformulate them into self-contained problems. The resulting high-quality evaluation suite closely mirrors real-world research scenarios while effectively mitigating train–test data contamination. Experimental results reveal that even state-of-the-art large language models achieve only 10–15% pass@1 accuracy on theorem proving within this benchmark, underscoring a significant gap in their capacity for research-level mathematical reasoning.
📝 Abstract
We present a new approach for benchmarking Large Language Model (LLM) capabilities on research-level mathematics. Existing benchmarks largely rely on static, hand-curated sets of contest or textbook-style problems as proxies for mathematical research. Instead, we establish an updatable benchmark evaluating models directly on the latest research results in mathematics. This consists of an automatic pipeline that extracts lemmas from arXiv and rewrites them into self-contained statements by making all assumptions and required definitions explicit. It results in a benchmark that can be updated regularly with new problems taken directly from human mathematical research, while previous instances can be used for training without compromising future evaluations. We benchmark current state-of-the-art LLMs, which obtain around 10-15$\%$ accuracy in theorem proving (pass@1) depending on the model, showing that there is currently a large margin of progression for LLMs to reach human-level proving capabilities in a research context.