🤖 AI Summary
Large language models (LLMs) significantly underperform in generating formal verification proofs for Rust programs compared to their code generation capabilities.
Method: We propose the first LLM-driven, three-stage automated proof generation framework—draft generation, prompt-guided refinement, and error-driven debugging—built upon a multi-agent architecture that tightly integrates Verus verifier feedback. Our approach incorporates Rust syntax-aware generation strategies and iterative, verification-feedback-informed prompt engineering.
Contribution/Results: We introduce the first nontrivial Rust proof benchmark comprising 150 verification tasks. Experiments show our framework achieves >90% verification pass rate on this benchmark; over 50% of tasks yield correct, verifiable proofs within 30 seconds or ≤3 LLM invocations. This work substantially advances the practical applicability of LLMs in formal verification of systems programming languages.
📝 Abstract
Generative AI has shown its values for many software engineering tasks. Still in its infancy, large language model (LLM)-based proof generation lags behind LLM-based code generation. In this paper, we present AutoVerus. AutoVerus uses LLM to automatically generate correctness proof for Rust code. AutoVerus is designed to match the unique features of Verus, a verification tool that can prove the correctness of Rust code using proofs and specifications also written in Rust. AutoVerus consists of a network of LLM agents that are crafted and orchestrated to mimic human experts' three phases of proof construction: preliminary proof generation, proof refinement guided by generic tips, and proof debugging guided by verification errors. To thoroughly evaluate AutoVerus and help foster future research in this direction, we have built a benchmark suite of 150 non-trivial proof tasks, based on existing code-generation benchmarks and verification benchmarks. Our evaluation shows that AutoVerus can automatically generate correct proof for more than 90% of them, with more than half of them tackled in less than 30 seconds or 3 LLM calls.