MathAtlas: A Benchmark for Autoformalization in the Wild

📅 2026-05-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

186K/year
🤖 AI Summary
Existing automated formalization benchmarks are largely confined to olympiad or undergraduate-level mathematics, lacking coverage of graduate-level and beyond. This work proposes MathAtlas—the first large-scale benchmark for graduate-level automated formalization—encompassing approximately 52,000 mathematical entities extracted from 103 textbooks and introducing, for the first time, a mathematical dependency graph with 178,000 relations to enable dependency-aware formalization evaluation. Constructed through textbook text extraction, entity recognition, and relation inference, MathAtlas is both high-quality and highly challenging: even the strongest baseline achieves only 9.8% and 16.7% accuracy on theorems and definitions, respectively, dropping further to 2.6% on the MA-Hard subset, which contains the most deeply nested dependencies.
📝 Abstract
Current autoformalization benchmarks are largely focused on olympiad or undergraduate mathematics, while graduate and research-level mathematics remains underexplored. In this paper, we introduce MathAtlas, the first large-scale autoformalization benchmark of in the wild graduate-level mathematics, containing ~52k theorems, definitions, exercises, examples, and proofs extracted from 103 graduate mathematics textbooks. MathAtlas is enriched with a mathematical dependency graph containing ~178k relations, and is the first autoformalization benchmark to include such relations, facilitating evaluation and development of dependency-aware autoformalization systems. Our extensive experiments show that MathAtlas is high quality but extremely challenging: strong baselines achieve at most 9.8% correctness on theorem statements and 16.7% on definitions. Furthermore, we find performance of state-of-the-art models degrades substantially with dependency depth: on MA-Hard, a subset of 700 entities with the deepest dependency trees, the best model achieves only 2.6% correctness for autoformalization on this challenging dataset. We release MathAtlas to the community as a benchmark set for large-scale autoformalization of graduate-level mathematics in the wild.
Problem

Research questions and friction points this paper is trying to address.

autoformalization
graduate-level mathematics
benchmark
mathematical dependency
in the wild
Innovation

Methods, ideas, or system contributions that make the work stand out.

autoformalization
graduate-level mathematics
mathematical dependency graph
benchmark
in the wild
🔎 Similar Papers
No similar papers found.
N
Nilay Patel
University of California, Santa Cruz
N
Noah Arias
University of California, Santa Cruz
D
Davit Babayan
University of California, Santa Cruz
V
Victoria Cochran
University of California, Santa Cruz
T
Timothy Libman
University of California, Santa Cruz
H
Hafsah Mahmood
University of California, Santa Cruz
L
Liam McCarty
University of California, Santa Cruz
S
Soli Munoz
University of California, Santa Cruz
L
Laurel Willey
University of California, Santa Cruz
Jeffrey Flanigan
Jeffrey Flanigan
Assistant Professor, University of California Santa Cruz
Natural Language ProcessingSemantic ParsingGenerationMachine Learning