From Law to Gherkin: A Human-Centred Quasi-Experiment on the Quality of LLM-Generated Behavioural Specifications from Food-Safety Regulations

📅 2025-08-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The technological neutrality of legal texts impedes software engineers from efficiently and accurately translating them into executable compliance requirements; current manual translation processes are time-consuming, error-prone, and heavily reliant on domain experts. Method: This paper proposes an automated approach leveraging large language models (LLMs)—specifically Claude and Llama—to precisely map food safety regulations into Gherkin-formatted behavioral specifications, enabling Behavior-Driven Development (BDD)-driven compliance engineering and testing. Contribution/Results: Through a human-centered quasi-experiment—the first systematic evaluation of LLM-generated behavioral specifications—the study demonstrates high scores across relevance, clarity, and completeness. Empirical results show significant reduction in requirement translation time and broad developer acceptance of the method’s engineering utility. The core contribution is an end-to-end generative framework that bridges regulatory texts to testable behavioral specifications, empirically validated for effectiveness and feasibility.

Technology Category

Application Category

📝 Abstract
Context: Laws and regulations increasingly affect software design and quality assurance, but legal texts are written in technology-neutral language. This creates challenges for engineers who must develop compliance artifacts such as requirements and acceptance criteria. Manual creation is labor-intensive, error-prone, and requires domain expertise. Advances in Generative AI (GenAI), especially Large Language Models (LLMs), offer a way to automate deriving such artifacts. Objective: We present the first systematic human-subject study of LLMs' ability to derive behavioral specifications from legal texts using a quasi-experimental design. These specifications translate legal requirements into a developer-friendly form. Methods: Ten participants evaluated specifications generated from food-safety regulations by Claude and Llama. Using Gherkin, a structured BDD language, 60 specifications were produced. Each participant assessed 12 across five criteria: Relevance, Clarity, Completeness, Singularity, and Time Savings. Each specification was reviewed by two participants, yielding 120 assessments. Results: For Relevance, 75% of ratings were highest and 20% second-highest. Clarity reached 90% highest. Completeness: 75% highest, 19% second. Singularity: 82% highest, 12% second. Time Savings: 68% highest, 24% second. No lowest ratings occurred. Mann-Whitney U tests showed no significant differences across participants or models. Llama slightly outperformed Claude in Clarity, Completeness, and Time Savings, while Claude was stronger in Singularity. Feedback noted hallucinations and omissions but confirmed the utility of the specifications. Conclusion: LLMs can generate high-quality Gherkin specifications from legal texts, reducing manual effort and providing structured artifacts useful for implementation, assurance, and test generation.
Problem

Research questions and friction points this paper is trying to address.

Automating legal compliance artifact generation from regulations
Evaluating LLM-generated behavioral specifications quality
Translating legal texts into developer-friendly Gherkin language
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs generate Gherkin from legal texts
Human evaluation confirms high specification quality
Automated translation reduces manual compliance effort