🤖 AI Summary
This work addresses the limited adoption of formal verification, which often requires expert-written annotations such as preconditions, postconditions, and loop invariants. To overcome this barrier, the authors propose a novel approach that leverages large language models (LLMs) in conjunction with assertions from test cases as static oracles to automatically generate Dafny verification annotations from code annotated with natural language comments. The method features an iterative refinement process guided by verifier feedback over multiple rounds and uniquely integrates multi-model LLM collaboration with a closed-loop verifier feedback mechanism. A VS Code plugin was developed to support practical deployment. Evaluated on 110 Dafny programs, the approach achieves a 98.2% annotation correctness rate within at most eight repair iterations. Empirical results highlight that proof-assistant-style annotation remains a key challenge for LLMs, while user feedback on the plugin was notably positive.
📝 Abstract
Recent verification tools aim to make formal verification more accessible to software engineers by automating most of the verification process. However, annotating conventional programs with the formal specification and verification constructs (preconditions, postconditions, loop invariants, auxiliary predicates and functions and proof helpers) required to prove their correctness still demands significant manual effort and expertise. This paper investigates how LLMs can automatically generate such annotations for programs written in Dafny, a verification-aware programming language, starting from conventional code accompanied by natural language specifications (in comments) and test code. In experiments on 110 Dafny programs, a multimodel approach combining Claude Opus 4.5 and GPT-5.2 generated correct annotations for 98.2% of the programs within at most 8 repair iterations, using verifier feedback. A logistic regression analysis shows that proof-helper annotations contribute disproportionately to problem difficulty for current LLMs. Assertions in the test cases served as static oracles to automatically validate the generated pre/postconditions. We also compare generated and manual solutions and present an extension for Visual Studio Code to incorporate automatic generation into the IDE, with encouraging usability feedback.