🤖 AI Summary
In formal verification, LLM-generated initial proofs frequently contain errors, and existing correction methods rely on static, fixed strategies that lack adaptability to diverse error types, hindering automation and scalability. To address this, we propose Adapt, the first framework leveraging LLM-driven dynamic strategy selection: it adaptively schedules multiple correction strategies based on real-time proof states, contextual information, and fine-grained error diagnostics, enabling closed-loop optimization. Adapt supports cross-model generalization without manual strategy configuration. Ablation studies confirm the necessity and efficacy of each component. Evaluated on two mainstream theorem-proving benchmarks—MiniF2F and ProofNet—Adapt achieves absolute improvements of 16.63% and 18.58% in theorem-proving success rate over the strongest baselines, respectively. These results demonstrate substantial gains in robustness and practical utility of LLMs for formal verification.
📝 Abstract
Formal verification via theorem proving enables the expressive specification and rigorous proof of software correctness, but it is difficult to scale due to the significant manual effort and expertise required. While Large Language Models (LLMs) show potential in proof generation, they frequently produce incorrect proofs on the first attempt and require additional strategies for iterative refinement. However, existing approaches employ fixed refinement strategies and cannot dynamically choose an effective strategy based on the particular issues in a generated proof, which limits their performance. To overcome this limitation, we introduce Adapt, a novel proof refinement framework that leverages an LLM-guided decision-maker to dynamically select a suitable refinement strategy according to the state of the proof assistant and available context of an incorrect proof. We evaluate Adapt on two benchmarks against four existing methods and find that it significantly outperforms the best baseline on both by proving 16.63% and 18.58% more theorems, respectively. Furthermore, we demonstrate Adapt's generalizability by evaluating it across five different LLMs. We also conduct ablation studies to measure the contribution of each component and compare the trade-offs of alternative decision-maker designs.