🤖 AI Summary
This work investigates the effectiveness of large language models (LLMs) in assisting interactive theorem proving, particularly within real-world, industrial-scale formal verification tasks. Method: We conduct a systematic evaluation on two authentic verification projects—hs-to-coq and Verdi—using the Rocq proof assistant, employing both quantitative metrics (success rate, proof length, error rate) and qualitative analysis (tactic reasonableness, technical reusability). Contribution/Results: LLMs demonstrate strong capability in generating concise, high-quality formal proofs aligned with classical proof styles, scaling effectively from small to large proofs. Performance critically depends on external dependency information and contextual modeling quality, exhibiting marked heterogeneity across projects. While rare anomalous errors occur, they are substantially mitigated via context enhancement. To our knowledge, this is the first empirical study to characterize LLM capabilities and key limiting factors—such as dependency awareness and context fidelity—in realistic, production-grade formal verification settings, thereby providing foundational evidence and practical guidance for LLM-augmented trustworthy software verification.
📝 Abstract
Large language models (LLMs) can potentially help with verification using proof assistants by automating proofs. However, it is unclear how effective LLMs are in this task. In this paper, we perform a case study based on two mature Rocq projects: the hs-to-coq tool and Verdi. We evaluate the effectiveness of LLMs in generating proofs by both quantitative and qualitative analysis. Our study finds that: (1) external dependencies and context in the same source file can significantly help proof generation; (2) LLMs perform great on small proofs but can also generate large proofs; (3) LLMs perform differently on different verification projects; and (4) LLMs can generate concise and smart proofs, apply classical techniques to new definitions, but can also make odd mistakes.