A Case Study on the Effectiveness of LLMs in Verification with Proof Assistants

📅 2025-08-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the effectiveness of large language models (LLMs) in assisting interactive theorem proving, particularly within real-world, industrial-scale formal verification tasks. Method: We conduct a systematic evaluation on two authentic verification projects—hs-to-coq and Verdi—using the Rocq proof assistant, employing both quantitative metrics (success rate, proof length, error rate) and qualitative analysis (tactic reasonableness, technical reusability). Contribution/Results: LLMs demonstrate strong capability in generating concise, high-quality formal proofs aligned with classical proof styles, scaling effectively from small to large proofs. Performance critically depends on external dependency information and contextual modeling quality, exhibiting marked heterogeneity across projects. While rare anomalous errors occur, they are substantially mitigated via context enhancement. To our knowledge, this is the first empirical study to characterize LLM capabilities and key limiting factors—such as dependency awareness and context fidelity—in realistic, production-grade formal verification settings, thereby providing foundational evidence and practical guidance for LLM-augmented trustworthy software verification.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) can potentially help with verification using proof assistants by automating proofs. However, it is unclear how effective LLMs are in this task. In this paper, we perform a case study based on two mature Rocq projects: the hs-to-coq tool and Verdi. We evaluate the effectiveness of LLMs in generating proofs by both quantitative and qualitative analysis. Our study finds that: (1) external dependencies and context in the same source file can significantly help proof generation; (2) LLMs perform great on small proofs but can also generate large proofs; (3) LLMs perform differently on different verification projects; and (4) LLMs can generate concise and smart proofs, apply classical techniques to new definitions, but can also make odd mistakes.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM effectiveness in proof assistant verification
Assessing LLM performance across different verification projects
Identifying factors influencing LLM proof generation success
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs automate proof generation in verification
External dependencies and context aid proof generation
LLMs perform variably across different verification projects
🔎 Similar Papers
No similar papers found.
B
Barış Bayazıt
University of Toronto
Y
Yao Li
Portland State University
Xujie Si
Xujie Si
University of Toronto & Mila
Programming LanguagesAutomated ReasoningAI4MathXAIAI Alignment