π€ AI Summary
This paper addresses the high manual effort and poor generalizability in establishing traceability between software requirements and legal regulations (e.g., GDPR). We propose an automated legal traceability method and conduct the first systematic comparison of classification-based (Kashif, built on fine-tuned Sentence-BERT) and generation-based (Rice, a prompt-engineering-driven large language model) approaches on legal traceability tasks. Results demonstrate that the generation-based approach substantially outperforms the fine-tuned classifier: Kashif achieves 67% recall on the benchmark dataset (a 54-percentage-point improvement over baseline), whereas Rice attains 84% recall on real-world GDPR documentsβ69 percentage points higher than Kashif. Our core contribution lies in empirically validating that prompt-engineering-driven generative paradigms exhibit superior zero-shot generalization capability and greater practical deployability for legal traceability tasks.
π Abstract
New regulations are continuously introduced to ensure that software development complies with the ethical concerns and prioritizes public safety. A prerequisite for demonstrating compliance involves tracing software requirements to legal provisions. Requirements traceability is a fundamental task where requirements engineers are supposed to analyze technical requirements against target artifacts, often under limited time budget. Doing this analysis manually for complex systems with hundreds of requirements is infeasible. The legal dimension introduces additional challenges that only exacerbate manual effort. In this paper, we investigate two automated solutions based on large language models (LLMs) to predict trace links between requirements and legal provisions. The first solution, Kashif, is a classifier that leverages sentence transformers. The second solution prompts a recent generative LLM based on Rice, a prompt engineering framework. On a benchmark dataset, we empirically evaluate Kashif and compare it against a baseline classifier from the literature. Kashif can identify trace links with an average recall of ~67%, outperforming the baseline with a substantial gain of 54 percentage points (pp) in recall. However, on unseen, more complex requirements documents traced to the European general data protection regulation (GDPR), Kashif performs poorly, yielding an average recall of 15%. On the same documents, however, our Rice-based solution yields an average recall of 84%, with a remarkable gain of about 69 pp over Kashif. Our results suggest that requirements traceability in the legal context cannot be simply addressed by building classifiers, as such solutions do not generalize and fail to perform well on complex regulations and requirements. Resorting to generative LLMs, with careful prompt engineering, is thus a more promising alternative.