IMProofBench: Benchmarking AI on Research-Level Mathematical Proof Generation

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing mathematical reasoning benchmarks primarily target high-school competition problems or answer-matching tasks, failing to assess large language models’ (LLMs) genuine capability in generating research-level formal proofs. Method: We introduce IMProofBench—the first research-oriented benchmark for evaluating LLMs on frontier mathematical proof generation—comprising 39 original, expert-crafted problems. It supports both human evaluation and fine-grained, automated subproblem scoring. We further propose an LLM-tool collaborative agent framework integrating web search and symbolic computation systems (e.g., SageMath) to enable exploratory proof synthesis. Contribution/Results: Experiments reveal nascent but measurable research-level reasoning ability: Grok-4 achieves 52% answer accuracy, while GPT-5 attains the highest full-proof correctness rate at 22%. IMProofBench bridges a critical gap in mathematical reasoning evaluation, establishing a reproducible, scalable assessment paradigm for AI-augmented mathematical discovery.

Technology Category

Application Category

📝 Abstract
As the mathematical capabilities of large language models (LLMs) improve, it becomes increasingly important to evaluate their performance on research-level tasks at the frontier of mathematical knowledge. However, existing benchmarks are limited, as they focus solely on final-answer questions or high-school competition problems. To address this gap, we introduce IMProofBench, a private benchmark consisting of 39 peer-reviewed problems developed by expert mathematicians. Each problem requires a detailed proof and is paired with subproblems that have final answers, supporting both an evaluation of mathematical reasoning capabilities by human experts and a large-scale quantitative analysis through automated grading. Furthermore, unlike prior benchmarks, the evaluation setup simulates a realistic research environment: models operate in an agentic framework with tools like web search for literature review and mathematical software such as SageMath. Our results show that current LLMs can succeed at the more accessible research-level questions, but still encounter significant difficulties on more challenging problems. Quantitatively, Grok-4 achieves the highest accuracy of 52% on final-answer subproblems, while GPT-5 obtains the best performance for proof generation, achieving a fully correct solution for 22% of problems. IMProofBench will continue to evolve as a dynamic benchmark in collaboration with the mathematical community, ensuring its relevance for evaluating the next generation of LLMs.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs on research-level mathematical proof generation
Addressing limitations of existing benchmarks with expert-curated problems
Simulating realistic research environments with tool-enhanced agentic frameworks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Private benchmark with expert-developed research problems
Agentic framework with web search and math tools
Combines human evaluation with automated grading
🔎 Similar Papers
No similar papers found.