LLM-based Automated Theorem Proving Hinges on Scalable Synthetic Data Generation

📅 2025-05-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of weak generalization and inefficient search in LLM-based automated theorem proving, stemming from scarce high-quality training data. We propose a proof-state-space sampling method for synthetic data generation, systematically covering diverse intermediate proof states and tactic combinations; and introduce an adaptive beam-width control mechanism to dynamically balance exploration and exploitation during tree search. The generated data enables one-shot fine-tuning of policy models, eliminating the need for iterative refinement or reinforcement learning. Our approach achieves 60.74% Pass@1 on MiniF2F and 21.18% on ProofNet—substantially outperforming state-of-the-art baselines. Key contributions are: (1) a scalable, semantically rich paradigm for constructing synthetic theorem-proving data; and (2) a data-driven framework that jointly optimizes search and training.

Technology Category

Application Category

📝 Abstract
Recent advancements in large language models (LLMs) have sparked considerable interest in automated theorem proving and a prominent line of research integrates stepwise LLM-based provers into tree search. In this paper, we introduce a novel proof-state exploration approach for training data synthesis, designed to produce diverse tactics across a wide range of intermediate proof states, thereby facilitating effective one-shot fine-tuning of LLM as the policy model. We also propose an adaptive beam size strategy, which effectively takes advantage of our data synthesis method and achieves a trade-off between exploration and exploitation during tree search. Evaluations on the MiniF2F and ProofNet benchmarks demonstrate that our method outperforms strong baselines under the stringent Pass@1 metric, attaining an average pass rate of $60.74%$ on MiniF2F and $21.18%$ on ProofNet. These results underscore the impact of large-scale synthetic data in advancing automated theorem proving.
Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM-based theorem proving via scalable synthetic data generation
Improving proof-state exploration for diverse tactic synthesis
Optimizing tree search with adaptive beam size strategy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel proof-state exploration for data synthesis
Adaptive beam size strategy for tree search
One-shot fine-tuning of LLM policy model
🔎 Similar Papers
No similar papers found.
Junyu Lai
Junyu Lai
State Key Laboratory for Novel Software Technology, Nanjing University, China
J
Jiakun Zhang
State Key Laboratory for Novel Software Technology, Nanjing University, China
S
Shuo Xu
State Key Laboratory for Novel Software Technology, Nanjing University, China
Taolue Chen
Taolue Chen
School of Computing and Mathematical Sciences, Birkbeck, University of London
Software EngineeringProgram Analysis and VerificationMachine learning
Z
Zihang Wang
State Key Laboratory for Novel Software Technology, Nanjing University, China
Y
Yao Yang
State Key Laboratory for Novel Software Technology, Nanjing University, China
J
Jiarui Zhang
State Key Laboratory for Novel Software Technology, Nanjing University, China
Chun Cao
Chun Cao
Nanjing University
J
Jingwei Xu
State Key Laboratory for Novel Software Technology, Nanjing University, China