🤖 AI Summary
This work investigates whether large language models (LLMs) can autonomously generate separation logic specifications for C programs that are verifiable by VeriFast, thereby improving the efficiency and practicality of static program verification. We conduct the first systematic evaluation of GPT-4o’s capability to produce ownership-aware, verifiable separation logic assertions—using both zero-shot and chain-of-thought prompting—across diverse C program inputs. Results show that while generated specifications generally preserve functional fidelity, only a small fraction succeed in VeriFast verification; those that do often contain redundant assertions, yielding low overall verification success rates. Our key contribution is exposing a critical disconnect between *semantic correctness* and *formal verifiability* in LLM-generated logical specifications. Furthermore, we establish the first empirical benchmark and failure-mode analysis for LLM-based program verification targeting separation logic, providing foundational insights for future research at the intersection of LLMs and formal methods.
📝 Abstract
Static verification is a powerful method for enhancing software quality, but it demands significant human labor and resources. This is particularly true of static verifiers that reason about heap manipulating programs using an ownership logic. LLMs have shown promise in a number of software engineering activities, including code generation, test generation, proof generation for theorem provers, and specification generation for static verifiers. However, prior work has not explored how well LLMs can perform specification generation for specifications based in an ownership logic, such as separation logic. To address this gap, this paper explores OpenAI's GPT-4o model's effectiveness in generating specifications on C programs that are verifiable with VeriFast, a separation logic based static verifier. Our experiment employs three different types of user inputs as well as basic and Chain-of-Thought (CoT) prompting to assess GPT's capabilities. Our results indicate that the specifications generated by GPT-4o preserve functional behavior, but struggle to be verifiable. When the specifications are verifiable they contain redundancies. Future directions are discussed to improve the performance.