🤖 AI Summary
Existing end-to-end automated theorem proving (ATP) systems rely heavily on expert-written formal premises, limiting their applicability to natural-language mathematical problems.
Method: We propose Mathesis, the first end-to-end system for high-stakes, real-world examination problems (Gaokao-Formal), comprising: (i) Mathesis-Autoformalizer—a reinforcement learning–based automatic formalization module; (ii) LeanScorer—a fine-grained formalization quality evaluation framework; and (iii) Mathesis-Prover—a customized proof generator.
Contribution/Results: On Gaokao-Formal, Mathesis improves automatic formalization pass rate by 22%. In end-to-end proof generation, it achieves 64% pass@32 on MiniF2F and sets a new state-of-the-art 18% pass@32 on Gaokao-Formal—marking a significant breakthrough in automating the natural-language-to-formal-proof pipeline.
📝 Abstract
Recent advances in large language models show strong promise for formal reasoning. However, most LLM-based theorem provers have long been constrained by the need for expert-written formal statements as inputs, limiting their applicability to real-world problems expressed in natural language. We tackle this gap with Mathesis, the first end-to-end theorem proving pipeline processing informal problem statements. It contributes Mathesis-Autoformalizer, the first autoformalizer using reinforcement learning to enhance the formalization ability of natural language problems, aided by our novel LeanScorer framework for nuanced formalization quality assessment. It also proposes a Mathesis-Prover, which generates formal proofs from the formalized statements. To evaluate the real-world applicability of end-to-end formal theorem proving, we introduce Gaokao-Formal, a benchmark of 488 complex problems from China's national college entrance exam. Our approach is carefully designed, with a thorough study of each component. Experiments demonstrate Mathesis's effectiveness, with the autoformalizer outperforming the best baseline by 22% in pass-rate on Gaokao-Formal. The full system surpasses other model combinations, achieving 64% accuracy on MiniF2F with pass@32 and a state-of-the-art 18% on Gaokao-Formal.