🤖 AI Summary
A semantic gap persists between food safety regulations and their implementation in software systems, hindering automated translation of legal provisions into executable software requirements.
Method: We propose the first regulatory-compliance-oriented conceptual framework for food safety, systematically linking legal texts to software requirements. Leveraging BERT and GPT-series models, we empirically evaluate classification performance on clause-level requirement identification. We further integrate few-shot prompting, supervised fine-tuning, and grounded theory analysis to assess cross-jurisdictional (U.S./Canada) generalization and the trade-off between fine-tuning efficacy and data efficiency.
Contribution/Results: Fine-tuned GPT-4o achieves 89% precision and 87% recall; few-shot prompting boosts recall to 97% (precision: 65%). Both approaches significantly outperform LSTM-based and keyword-matching baselines. Our framework advances automated regulatory-to-software requirement derivation, while findings provide empirical guidance on LLM adaptation strategies for legal compliance engineering.
📝 Abstract
As Industry 4.0 transforms the food industry, the role of software in achieving compliance with food-safety regulations is becoming increasingly critical. Food-safety regulations, like those in many legal domains, have largely been articulated in a technology-independent manner to ensure their longevity and broad applicability. However, this approach leaves a gap between the regulations and the modern systems and software increasingly used to implement them. In this article, we pursue two main goals. First, we conduct a Grounded Theory study of food-safety regulations and develop a conceptual characterization of food-safety concepts that closely relate to systems and software requirements. Second, we examine the effectiveness of two families of large language models (LLMs) -- BERT and GPT -- in automatically classifying legal provisions based on requirements-related food-safety concepts. Our results show that: (a) when fine-tuned, the accuracy differences between the best-performing models in the BERT and GPT families are relatively small. Nevertheless, the most powerful model in our experiments, GPT-4o, still achieves the highest accuracy, with an average Precision of 89% and an average Recall of 87%; (b) few-shot learning with GPT-4o increases Recall to 97% but decreases Precision to 65%, suggesting a trade-off between fine-tuning and few-shot learning; (c) despite our training examples being drawn exclusively from Canadian regulations, LLM-based classification performs consistently well on test provisions from the US, indicating a degree of generalizability across regulatory jurisdictions; and (d) for our classification task, LLMs significantly outperform simpler baselines constructed using long short-term memory (LSTM) networks and automatic keyword extraction.