🤖 AI Summary
In formal specification verification, manually crafting test cases is time-consuming, error-prone, and often neglected. This paper introduces the first large language model (LLM)-based automated test generation method for Alloy specifications, leveraging state-of-the-art models (e.g., GPT-5) to synthesize structured positive and negative test cases directly from natural-language requirements. The approach requires no manual prompt engineering or fine-tuning and natively supports dual validation—syntactic correctness and logical validity—against Alloy’s analyzer. Empirical evaluation demonstrates that GPT-5–generated test cases achieve high parsing and execution success rates in Alloy and effectively detect numerous bugs in manually written specifications, substantially improving specification reliability and development efficiency. This work provides the first systematic investigation into the capabilities and limitations of cutting-edge LLMs for test generation in formal verification, establishing a foundational framework for LLM-augmented formal methods.
📝 Abstract
Validation is a central activity when developing formal specifications. Similarly to coding, a possible validation technique is to define upfront test cases or scenarios that a future specification should satisfy or not. Unfortunately, specifying such test cases is burdensome and error prone, which could cause users to skip this validation task. This paper reports the results of an empirical evaluation of using pre-trained large language models (LLMs) to automate the generation of test cases from natural language requirements. In particular, we focus on test cases for structural requirements of simple domain models formalized in the Alloy specification language. Our evaluation focuses on the state-of-art GPT-5 model, but results from other closed- and open-source LLMs are also reported. The results show that, in this context, GPT-5 is already quite effective at generating positive (and negative) test cases that are syntactically correct and that satisfy (or not) the given requirement, and that can detect many wrong specifications written by humans.