🤖 AI Summary
Current approaches to automatically verifying natural language mathematical proofs lack modularity and unambiguous structure, especially in advanced mathematics. This work proposes pseudo-formalization (PF), a representation that decomposes proofs into self-contained modules, each explicitly specifying premises, conclusions, and inference steps. We introduce a block verification (BV) algorithm that leverages large language models to independently validate each module. Our method uniquely combines the structural rigor of formal proofs with the expressive flexibility of natural language. Evaluated on both Olympiad-level and research-grade mathematical benchmarks, it substantially outperforms LLM-as-judge baselines, achieving marked improvements in both precision and recall for error detection. We also release ArxivMathGradingBench, an open-source benchmark for evaluating proof verification systems.
📝 Abstract
Reliable verification of proofs remains a bottleneck for training and evaluating AI systems on hard mathematical reasoning. Fully formal proofs, in languages like Lean, are easy to verify because they are unambiguous and modular. Most proofs, particularly those written by AI systems, have neither property, and translating them into formal languages remains challenging in many frontier math settings. We propose Pseudo-Formalization (PF), a proof format that captures the modularity and precision of formal proofs while retaining the flexibility of natural language. A Pseudo-Formal proof is decomposed into self-contained modules, each stating its premises, conclusion, and proof in natural language. To verify the correctness of a regular natural language proof, an LLM translates it to Pseudo-Formal and then verifies each module independently, an algorithm we call Block Verification (BV). We evaluate PF+BV on two benchmarks spanning olympiad and research-level mathematics, where it pareto-dominates LLM-as-judge baselines on error-finding precision and recall. To support future work, we release our research-level proof verification benchmark ArxivMathGradingBench.