🤖 AI Summary
To address the scarcity of human-written proofs—limiting large language model (LLM) training data for formal verification—this paper proposes SAFE, a Symbolic-Verifier-Driven Autonomous Proof Evolution framework. SAFE integrates Rust semantic modeling, Z3-based symbolic verification, feedback-driven self-debugging, and synthetic-data-augmented fine-tuning to achieve fully automated, human-annotation-free proof generation for Rust programs. Its core innovation lies in leveraging erroneous proofs to train the model’s self-debugging capability and using the symbolic verifier as an unsupervised gold-standard oracle for capability refinement. Evaluated on an expert-constructed benchmark, SAFE achieves a 52.52% proof accuracy—substantially outperforming GPT-4o (14.39%)—and significantly advances zero-shot formal verification performance of open-source models.
📝 Abstract
Ensuring correctness is crucial for code generation. Formal verification offers a definitive assurance of correctness, but demands substantial human effort in proof construction and hence raises a pressing need for automation. The primary obstacle lies in the severe lack of data-there is much fewer proofs than code snippets for Large Language Models (LLMs) to train upon. In this paper, we introduce SAFE, a framework that overcomes the lack of human-written proofs to enable automated proof generation of Rust code. SAFE establishes a self-evolving cycle where data synthesis and fine-tuning collaborate to enhance the model capability, leveraging the definitive power of a symbolic verifier in telling correct proofs from incorrect ones. SAFE also re-purposes the large number of synthesized incorrect proofs to train the self-debugging capability of the fine-tuned models, empowering them to fix incorrect proofs based on the verifier's feedback. SAFE demonstrates superior efficiency and precision compared to GPT-4o. Through tens of thousands of synthesized proofs and the self-debugging mechanism, we improve the capability of open-source models, initially unacquainted with formal verification, to automatically write proofs for Rust code. This advancement leads to a significant improvement in performance, achieving a 52.52% accuracy rate in a benchmark crafted by human experts, a significant leap over GPT-4o's performance of 14.39%.