đ¤ AI Summary
Existing formal theorem proving (FTP) datasets are small-scale, expensive to construct, and lack rigorously aligned formalâinformal problem pairs, hindering large language model evaluation and training on complex mathematical reasoning. Method: We propose the first scalable, theory-of-computationâbased synthetic framework that automatically generates theoremâproof pairs with verifiably semantically consistent formal and informal specifications, leveraging the Busy Beaver function and hybrid Booleanâarithmetic problems. The framework outputs dual-format artifactsâLean 4 for formalization and Markdown for informal expositionâand performs end-to-end formal verification in Lean 4. Contribution/Results: Experiments reveal severe limitations in state-of-the-art models: peak solving rates of only 57.5% on Busy Beaver tasks and a sharp drop to 12% on hybrid Booleanâarithmetic tasks, exposing critical bottlenecks in long-range logical reasoning. Our work establishes a novel paradigm for FTP dataset construction.
đ Abstract
Formal theorem proving (FTP) has emerged as a critical foundation for evaluating the reasoning capabilities of large language models, enabling automated verification of mathematical proofs at scale. However, progress has been constrained by limited datasets due to the high cost of manual curation and the scarcity of challenging problems with verified formal-informal correspondences. We propose leveraging theoretical computer science (TCS) as a scalable source of rigorous proof problems, where algorithmic definitions enable automated generation of arbitrarily many challenging theorem-proof pairs. We demonstrate this approach on two TCS domains: Busy Beaver problems, which involve proving bounds on Turing machine halting behavior, and Mixed Boolean Arithmetic problems, which combine logical and arithmetic reasoning. Our framework automatically synthesizes problems with parallel formal (Lean4) and informal (Markdown) specifications, creating a scalable pipeline for generating verified proof challenges. Evaluation on frontier models reveals substantial gaps in automated theorem proving: while DeepSeekProver-V2-671B achieves 57.5% success on Busy Beaver problems, it manages only 12% on Mixed Boolean Arithmetic problems. These results highlight the difficulty of long-form proof generation even for problems that are computationally easy to verify, demonstrating the value of TCS domains for advancing automated reasoning research.