FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models

📅 2025-05-05

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Existing formal mathematical reasoning benchmarks suffer from limited scale, narrow domain coverage, and prohibitively high formalization costs, hindering rigorous evaluation of AI systems. Method: We introduce FormalMATH—the first large-scale, human-verified Lean 4 formalized mathematics benchmark, comprising 5,560 problems spanning algebra, number theory, calculus, and combinatorics, from International Mathematical Olympiad (IMO) to undergraduate level. We design a human-in-the-loop automated formalization pipeline integrating domain-specific LLM-based proposition formalization, multi-model semantic consistency verification, counterexample-driven filtering, and chain-of-thought sampling; notably, we find natural-language solution guidance degrades proof success rates. Contribution/Results: Our pipeline achieves a 72.09% proposition retention rate. State-of-the-art models attain only 16.46% average solving accuracy on FormalMATH, exposing substantial domain bias and overreliance on shallow heuristic strategies—highlighting critical gaps in current formal reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Formal mathematical reasoning remains a critical challenge for artificial intelligence, hindered by limitations of existing benchmarks in scope and scale. To address this, we present FormalMATH, a large-scale Lean4 benchmark comprising 5,560 formally verified problems spanning from high-school Olympiad challenges to undergraduate-level theorems across diverse domains (e.g., algebra, applied mathematics, calculus, number theory, and discrete mathematics). To mitigate the inefficiency of manual formalization, we introduce a novel human-in-the-loop autoformalization pipeline that integrates: (1) specialized large language models (LLMs) for statement autoformalization, (2) multi-LLM semantic verification, and (3) negation-based disproof filtering strategies using off-the-shelf LLM-based provers. This approach reduces expert annotation costs by retaining 72.09% of statements before manual verification while ensuring fidelity to the original natural-language problems. Our evaluation of state-of-the-art LLM-based theorem provers reveals significant limitations: even the strongest models achieve only 16.46% success rate under practical sampling budgets, exhibiting pronounced domain bias (e.g., excelling in algebra but failing in calculus) and over-reliance on simplified automation tactics. Notably, we identify a counterintuitive inverse relationship between natural-language solution guidance and proof success in chain-of-thought reasoning scenarios, suggesting that human-written informal reasoning introduces noise rather than clarity in the formal reasoning settings. We believe that FormalMATH provides a robust benchmark for benchmarking formal mathematical reasoning.

Problem

Research questions and friction points this paper is trying to address.

Addressing limitations in scope and scale of existing formal mathematical reasoning benchmarks

Introducing a human-in-the-loop autoformalization pipeline to reduce expert annotation costs

Evaluating and revealing significant limitations of state-of-the-art LLM-based theorem provers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale Lean4 benchmark with 5,560 problems

Human-in-the-loop autoformalization pipeline with LLMs

Multi-LLM semantic verification and negation filtering

🔎 Similar Papers

No similar papers found.