🤖 AI Summary
This work addresses the underexplored challenge of automatically generating formal specifications—expressed in first-order logic—from software comments and documentation using large language models (LLMs).
Method: We conduct the first systematic evaluation of 13 state-of-the-art LLMs (e.g., Codex, Llama, PaLM) against traditional approaches on three public benchmarks under few-shot settings. We introduce a cross-model failure diagnosis framework, establish a reproducible evaluation benchmark, and propose a taxonomy of failure modes.
Contribution/Results: Experiments reveal that certain LLMs achieve performance comparable to or exceeding traditional tools in specific scenarios; however, semantic abstraction, context sensitivity, and logical rigor remain critical bottlenecks. Our analysis uncovers complementary strengths between LLMs and classical methods, providing empirical foundations and concrete directions for advancing LLM-augmented formal methods. The benchmark, taxonomy, and diagnostic framework are publicly released to support reproducible research.
📝 Abstract
Software specifications are essential for many Software Engineering (SE) tasks such as bug detection and test generation. Many existing approaches are proposed to extract the specifications defined in natural language form (e.g., comments) into formal machine readable form (e.g., first order logic). However, existing approaches suffer from limited generalizability and require manual efforts. The recent emergence of Large Language Models (LLMs), which have been successfully applied to numerous SE tasks, offers a promising avenue for automating this process. In this paper, we conduct the first empirical study to evaluate the capabilities of LLMs for generating software specifications from software comments or documentation. We evaluate LLMs performance with Few Shot Learning (FSL) and compare the performance of 13 state of the art LLMs with traditional approaches on three public datasets. In addition, we conduct a comparative diagnosis of the failure cases from both LLMs and traditional methods, identifying their unique strengths and weaknesses. Our study offers valuable insights for future research to improve specification generation.