🤖 AI Summary
This work addresses the low tree-search efficiency and high computational overhead in large-scale theorem proving with Lean 4. To this end, it proposes a lightweight Best-First Search (BFS) as a viable alternative to complex tree-search methods (e.g., MCTS). Methodologically: (1) it introduces compilation-error–driven Direct Preference Optimization (DPO) for fine-tuning; (2) designs a difficulty-aware dynamic data filtering mechanism; and (3) adopts length normalization to encourage exploration of deeper proof paths. Evaluated on the MiniF2F benchmark, the approach achieves 71.31 points—comparable to state-of-the-art complex methods—while significantly improving search cost-effectiveness and scalability. This is the first systematic demonstration of BFS’s high efficiency in formal reasoning. Moreover, the work establishes an expert iterative training framework specifically tailored for theorem proving.
📝 Abstract
Recent advancements in large language models (LLMs) have spurred growing interest in automatic theorem proving using Lean4, where effective tree search methods are crucial for navigating proof search spaces. While the existing approaches primarily rely on value functions and Monte Carlo Tree Search (MCTS), the potential of simpler methods like Best-First Search (BFS) remains underexplored. This paper investigates whether BFS can achieve competitive performance in large-scale theorem proving tasks. We present exttt{BFS-Prover}, a scalable expert iteration framework, featuring three key innovations. First, we implement strategic data filtering at each expert iteration round, excluding problems solvable via beam search node expansion to focus on harder cases. Second, we improve the sample efficiency of BFS through Direct Preference Optimization (DPO) applied to state-tactic pairs automatically annotated with compiler error feedback, refining the LLM's policy to prioritize productive expansions. Third, we employ length normalization in BFS to encourage exploration of deeper proof paths. exttt{BFS-Prover} achieves a score of $71.31$ on the MiniF2F test set and therefore challenges the perceived necessity of complex tree search methods, demonstrating that BFS can achieve competitive performance when properly scaled.