Lean Meets Theoretical Computer Science: Scalable Synthesis of Theorem Proving Challenges in Formal-Informal Pairs

📅 2025-08-21

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Existing formal theorem proving (FTP) datasets are small-scale, expensive to construct, and lack rigorously aligned formal–informal problem pairs, hindering large language model evaluation and training on complex mathematical reasoning. Method: We propose the first scalable, theory-of-computation–based synthetic framework that automatically generates theorem–proof pairs with verifiably semantically consistent formal and informal specifications, leveraging the Busy Beaver function and hybrid Boolean–arithmetic problems. The framework outputs dual-format artifacts—Lean 4 for formalization and Markdown for informal exposition—and performs end-to-end formal verification in Lean 4. Contribution/Results: Experiments reveal severe limitations in state-of-the-art models: peak solving rates of only 57.5% on Busy Beaver tasks and a sharp drop to 12% on hybrid Boolean–arithmetic tasks, exposing critical bottlenecks in long-range logical reasoning. Our work establishes a novel paradigm for FTP dataset construction.

Technology Category

Application Category

📝 Abstract

Formal theorem proving (FTP) has emerged as a critical foundation for evaluating the reasoning capabilities of large language models, enabling automated verification of mathematical proofs at scale. However, progress has been constrained by limited datasets due to the high cost of manual curation and the scarcity of challenging problems with verified formal-informal correspondences. We propose leveraging theoretical computer science (TCS) as a scalable source of rigorous proof problems, where algorithmic definitions enable automated generation of arbitrarily many challenging theorem-proof pairs. We demonstrate this approach on two TCS domains: Busy Beaver problems, which involve proving bounds on Turing machine halting behavior, and Mixed Boolean Arithmetic problems, which combine logical and arithmetic reasoning. Our framework automatically synthesizes problems with parallel formal (Lean4) and informal (Markdown) specifications, creating a scalable pipeline for generating verified proof challenges. Evaluation on frontier models reveals substantial gaps in automated theorem proving: while DeepSeekProver-V2-671B achieves 57.5% success on Busy Beaver problems, it manages only 12% on Mixed Boolean Arithmetic problems. These results highlight the difficulty of long-form proof generation even for problems that are computationally easy to verify, demonstrating the value of TCS domains for advancing automated reasoning research.

Problem

Research questions and friction points this paper is trying to address.

Automated generation of theorem-proof pairs from theoretical computer science domains

Scalable synthesis of verified formal-informal proof challenge pairs

Addressing limited datasets for evaluating LLM reasoning capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated generation of theorem-proof pairs

Leveraging theoretical computer science domains

Parallel formal-informal specification synthesis

🔎 Similar Papers

A Semantic Search Engine for Mathlib4