LLM-based Automated Theorem Proving Hinges on Scalable Synthetic Data Generation

📅 2025-05-17

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses the challenge of weak generalization and inefficient search in LLM-based automated theorem proving, stemming from scarce high-quality training data. We propose a proof-state-space sampling method for synthetic data generation, systematically covering diverse intermediate proof states and tactic combinations; and introduce an adaptive beam-width control mechanism to dynamically balance exploration and exploitation during tree search. The generated data enables one-shot fine-tuning of policy models, eliminating the need for iterative refinement or reinforcement learning. Our approach achieves 60.74% Pass@1 on MiniF2F and 21.18% on ProofNet—substantially outperforming state-of-the-art baselines. Key contributions are: (1) a scalable, semantically rich paradigm for constructing synthetic theorem-proving data; and (2) a data-driven framework that jointly optimizes search and training.

Technology Category

Application Category

📝 Abstract

Recent advancements in large language models (LLMs) have sparked considerable interest in automated theorem proving and a prominent line of research integrates stepwise LLM-based provers into tree search. In this paper, we introduce a novel proof-state exploration approach for training data synthesis, designed to produce diverse tactics across a wide range of intermediate proof states, thereby facilitating effective one-shot fine-tuning of LLM as the policy model. We also propose an adaptive beam size strategy, which effectively takes advantage of our data synthesis method and achieves a trade-off between exploration and exploitation during tree search. Evaluations on the MiniF2F and ProofNet benchmarks demonstrate that our method outperforms strong baselines under the stringent Pass@1 metric, attaining an average pass rate of $60.74%$ on MiniF2F and $21.18%$ on ProofNet. These results underscore the impact of large-scale synthetic data in advancing automated theorem proving.

Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM-based theorem proving via scalable synthetic data generation

Improving proof-state exploration for diverse tactic synthesis

Optimizing tree search with adaptive beam size strategy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel proof-state exploration for data synthesis

Adaptive beam size strategy for tree search

One-shot fine-tuning of LLM policy model

🔎 Similar Papers

No similar papers found.