Validating Formal Specifications with LLM-generated Test Cases

📅 2025-10-27

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

In formal specification verification, manually crafting test cases is time-consuming, error-prone, and often neglected. This paper introduces the first large language model (LLM)-based automated test generation method for Alloy specifications, leveraging state-of-the-art models (e.g., GPT-5) to synthesize structured positive and negative test cases directly from natural-language requirements. The approach requires no manual prompt engineering or fine-tuning and natively supports dual validation—syntactic correctness and logical validity—against Alloy’s analyzer. Empirical evaluation demonstrates that GPT-5–generated test cases achieve high parsing and execution success rates in Alloy and effectively detect numerous bugs in manually written specifications, substantially improving specification reliability and development efficiency. This work provides the first systematic investigation into the capabilities and limitations of cutting-edge LLMs for test generation in formal verification, establishing a foundational framework for LLM-augmented formal methods.

Technology Category

Application Category

📝 Abstract

Validation is a central activity when developing formal specifications. Similarly to coding, a possible validation technique is to define upfront test cases or scenarios that a future specification should satisfy or not. Unfortunately, specifying such test cases is burdensome and error prone, which could cause users to skip this validation task. This paper reports the results of an empirical evaluation of using pre-trained large language models (LLMs) to automate the generation of test cases from natural language requirements. In particular, we focus on test cases for structural requirements of simple domain models formalized in the Alloy specification language. Our evaluation focuses on the state-of-art GPT-5 model, but results from other closed- and open-source LLMs are also reported. The results show that, in this context, GPT-5 is already quite effective at generating positive (and negative) test cases that are syntactically correct and that satisfy (or not) the given requirement, and that can detect many wrong specifications written by humans.

Problem

Research questions and friction points this paper is trying to address.

Automating test case generation from natural language requirements

Validating formal specifications using LLM-generated test cases

Evaluating GPT-5 effectiveness for Alloy specification testing

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs automate test case generation from requirements

GPT-5 generates correct positive and negative test cases

Generated test cases detect errors in human specifications

🔎 Similar Papers

SpecGen: Automated Generation of Formal Program Specifications via Large Language Models