Lean Meets Theoretical Computer Science: Scalable Synthesis of Theorem Proving Challenges in Formal-Informal Pairs

📅 2025-08-21
📈 Citations: 0
✨ Influential: 0
📄 PDF
🤖 AI Summary
Existing formal theorem proving (FTP) datasets are small-scale, expensive to construct, and lack rigorously aligned formal–informal problem pairs, hindering large language model evaluation and training on complex mathematical reasoning. Method: We propose the first scalable, theory-of-computation–based synthetic framework that automatically generates theorem–proof pairs with verifiably semantically consistent formal and informal specifications, leveraging the Busy Beaver function and hybrid Boolean–arithmetic problems. The framework outputs dual-format artifacts—Lean 4 for formalization and Markdown for informal exposition—and performs end-to-end formal verification in Lean 4. Contribution/Results: Experiments reveal severe limitations in state-of-the-art models: peak solving rates of only 57.5% on Busy Beaver tasks and a sharp drop to 12% on hybrid Boolean–arithmetic tasks, exposing critical bottlenecks in long-range logical reasoning. Our work establishes a novel paradigm for FTP dataset construction.

Technology Category

Application Category

📝 Abstract
Formal theorem proving (FTP) has emerged as a critical foundation for evaluating the reasoning capabilities of large language models, enabling automated verification of mathematical proofs at scale. However, progress has been constrained by limited datasets due to the high cost of manual curation and the scarcity of challenging problems with verified formal-informal correspondences. We propose leveraging theoretical computer science (TCS) as a scalable source of rigorous proof problems, where algorithmic definitions enable automated generation of arbitrarily many challenging theorem-proof pairs. We demonstrate this approach on two TCS domains: Busy Beaver problems, which involve proving bounds on Turing machine halting behavior, and Mixed Boolean Arithmetic problems, which combine logical and arithmetic reasoning. Our framework automatically synthesizes problems with parallel formal (Lean4) and informal (Markdown) specifications, creating a scalable pipeline for generating verified proof challenges. Evaluation on frontier models reveals substantial gaps in automated theorem proving: while DeepSeekProver-V2-671B achieves 57.5% success on Busy Beaver problems, it manages only 12% on Mixed Boolean Arithmetic problems. These results highlight the difficulty of long-form proof generation even for problems that are computationally easy to verify, demonstrating the value of TCS domains for advancing automated reasoning research.
Problem

Research questions and friction points this paper is trying to address.

Automated generation of theorem-proof pairs from theoretical computer science domains
Scalable synthesis of verified formal-informal proof challenge pairs
Addressing limited datasets for evaluating LLM reasoning capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated generation of theorem-proof pairs
Leveraging theoretical computer science domains
Parallel formal-informal specification synthesis
🔎 Similar Papers
No similar papers found.
Terry Jingchen Zhang
Terry Jingchen Zhang
ETH Zurich
(Multimodal) ReasoningAI SafetyActionable InterpretabilityAI4ScienceAstrophysics
W
Wenyuan Jiang
D-INFK, ETH Zurich, Zurich, Switzerland
R
Rongchuan Liu
D-INFK, ETH Zurich, Zurich, Switzerland
Y
Yisong Wang
D-INFK, ETH Zurich, Zurich, Switzerland
J
Junran Yang
Independent Researcher
N
Ning Wang
D-INFK, ETH Zurich, Zurich, Switzerland
N
Nicole Ni
University of Pennsylvania, PA, USA
Yinya Huang
Yinya Huang
Postdoc Fellow at ETH AI Center, ETH ZĂźrich; Prev. CityU Hong Kong, SYSU
AI for MathAI for ScienceReliable Machine LearningLLMsNLP
Mrinmaya Sachan
Mrinmaya Sachan
Assistant Professor, ETH ZĂźrich
Natural Language ProcessingReasoningAI for Education