🤖 AI Summary
This work presents the first systematic investigation into the capability of large language models (LLMs) to generate program specifications involving higher-order logical constructs, which are essential for expressing complex verification properties yet remain beyond the reach of existing LLMs that predominantly handle basic syntactic forms. The authors design four syntactic configurations spanning different levels of abstraction and establish a comprehensive evaluation framework to assess a range of representative LLMs on standard verification benchmarks. Experimental results demonstrate that LLMs can effectively produce valid higher-order logical expressions; moreover, integrating logical constructs with base syntax significantly enhances verification efficacy and robustness without substantially increasing verification overhead. The study also reveals distinct advantages of two refinement paradigms in specification generation.
📝 Abstract
Formal specifications play a pivotal role in accurately characterizing program behaviors and ensuring software correctness. In recent years, leveraging large language models (LLMs) for the automatic generation of program specifications has emerged as a promising avenue for enhancing verification efficiency. However, existing research has been predominantly confined to generating specifications based on basic syntactic constructs, falling short of meeting the demands for high-level abstraction in complex program verification. Consequently, we propose incorporating logical constructs into existing LLM-based specification generation framework. Nevertheless, there remains a lack of systematic investigation into whether LLMs can effectively generate such complex constructs. To this end, we conduct an empirical study aimed at exploring the impact of various types of syntactic constructs on specification generation framework. Specifically, we define four syntactic configurations with varying levels of abstraction and perform extensive evaluations on mainstream program verification datasets, employing a diverse set of representative LLMs. Experimental results first confirm that LLMs are capable of generating valid logical constructs. Further analysis reveals that the synergistic use of logical constructs and basic syntactic constructs leads to improvements in both verification capability and robustness, without significantly increasing verification overhead. Additionally, we uncover the distinct advantages of two refinement paradigms. To the best of our knowledge, this is the first systematic work exploring the feasibility of utilizing LLMs for generating high-level logical constructs, providing an empirical basis and guidance for the future construction of automated program verification framework with enhanced abstraction capabilities.