IndiMathBench: Autoformalizing Mathematical Reasoning Problems with a Human Touch

📅 2025-11-30

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the automatic formalization of natural-language mathematical problems into the theorem-proving language Lean 4, targeting the critical gap between syntactic validity and semantic correctness to improve proof success rates. To this end, we introduce MathOlympiad-Lean4—the first human-verified benchmark comprising 312 Indian Mathematical Olympiad problems with precise, expert-curated Lean 4 formalizations. We propose a human-AI collaborative formalization framework integrating category-based retrieval, multi-model ensemble generation, compilation-feedback-driven iterative refinement, and expert interaction for validation. We further develop an automated quality assessment dashboard. Experiments reveal that state-of-the-art models exhibit substantial deficiencies in semantic correctness and achieve low proof success rates, confirming the benchmark’s high difficulty. MathOlympiad-Lean4 thus establishes a rigorous, reproducible evaluation platform for advancing mathematical reasoning and automated formalization research.

Technology Category

Application Category

📝 Abstract

We introduce IndiMathBench, a human-verified benchmark designed to evaluate mathematical theorem proving, curated using an AI-powered human-assisted pipeline for formalizing natural language problems in Lean. IndiMathBench is composed of 312 formal Lean 4 theorems paired with their corresponding informal problem statements, sourced from Indian Mathematics Olympiads. Through category-based retrieval, iterative compiler feedback, and multi-model ensembles, our pipeline generates candidate formalizations that experts efficiently validate via an interactive dashboard with automated quality summaries. Evaluation across multiple frontier models demonstrates that autoformalization remains challenging, with substantial gaps between syntactic validity and semantic correctness, while theorem proving success rates remain low even with iterative refinement, demonstrating that enchmark~presents a challenging testbed for mathematical reasoning. IndiMathBench is available at https://github.com/prmbiy/IndiMathBench.

Problem

Research questions and friction points this paper is trying to address.

Autoformalizes natural language math problems into Lean

Evaluates theorem proving via human-verified benchmark

Assesses semantic correctness gaps in autoformalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

AI-human pipeline for formalizing natural language math problems

Iterative compiler feedback and multi-model ensembles for candidate generation

Expert validation via interactive dashboard with automated summaries

🔎 Similar Papers

No similar papers found.