🤖 AI Summary
This work addresses the limitations of existing large language models in Lean 4 automated theorem proving, which either suffer from fragmented proofs due to step-level supervision or fail to capture global proof structure when generating entire proofs at once, largely overlooking intermediate-granularity supervision. The authors propose a segment-level supervision strategy that treats supervision granularity as a data construction problem over proof trajectories, extracting locally coherent fragments from complete proofs to train policy models. During inference, they introduce a goal-aware short-horizon backtracking mechanism. This approach constitutes the first systematic exploration and validation of segment-level supervision as an effective paradigm bridging step-level and full-proof supervision, naturally aligning proof structure during both training and inference. Experiments show that segment-level models achieve proof success rates of 64.84%, 60.90%, and 66.31% on miniF2F, LeanWorkbook, and NuminaMath-LEAN, respectively; with goal-aware backtracking, BFS-Prover-V2-7B and InternLM2.5-StepProver attain 70.74% and 60.33% success rates while reducing inference overhead.
📝 Abstract
Automated theorem proving with large language models in Lean 4 is commonly approached through either step-level tactic prediction with tree search or whole-proof generation. These two paradigms represent opposite granularities for constructing supervised training data: the former provides dense local signals but may fragment coherent proof processes, while the latter preserves global structure but requires complex end-to-end generation. In this paper, we revisit supervision granularity as a training set construction problem over proof trajectories and propose segment-level supervision, a training data construction strategy that extracts locally coherent proof segments for training policy models. We further reuse the same strategy at inference time to trigger short rollouts for existing step-level models. When trained with segment-level supervision on STP, LeanWorkbook, and NuminaMath-LEAN, the resulting policy models achieve proof success rates of 64.84%, 60.90%, and 66.31% on miniF2F, respectively, consistently outperforming both step-level and whole-proof baselines. Goal-aware rollout further improves existing step-level provers while reducing inference costs. It increases the proof success rate of BFS-Prover-V2-7B from 68.77% to 70.74% and that of InternLM2.5-StepProver from 59.59% to 60.33%, showing that appropriate supervision granularity better aligns model learning with proof structure and search. Code and models are available at https://github.com/NJUDeepEngine/SEG-ATP.