🤖 AI Summary
This study investigates how “requirement smells”—such as ambiguity and inconsistency in requirement specifications—affect the performance of large language models (LLMs) in requirement-to-code traceability tasks. Method: We conduct controlled experiments on two state-of-the-art LLMs, integrating established software engineering techniques for requirement smell detection and traceability evaluation. Contribution/Results: Our empirical analysis reveals, for the first time, that requirement smells significantly—but modestly—degrade LLM accuracy in existential judgment tasks (i.e., determining whether a requirement is implemented), with statistical significance (p < 0.05); however, they exert no statistically significant effect on line-level precise traceability. This demonstrates task-sensitive impact of requirement quality on LLM behavior. The findings provide critical empirical evidence to inform prompt engineering and requirement quality assurance practices in AI-driven software engineering.
📝 Abstract
Large language models (LLMs) are increasingly used to generate software artifacts, such as source code, tests, and trace links. Requirements play a central role in shaping the input prompts that guide LLMs, as they are often used as part of the prompts to synthesize the artifacts. However, the impact of requirements formulation on LLM performance remains unclear. In this paper, we investigate the role of requirements smells-indicators of potential issues like ambiguity and inconsistency-when used in prompts for LLMs. We conducted experiments using two LLMs focusing on automated trace link generation between requirements and code. Our results show mixed outcomes: while requirements smells had a small but significant effect when predicting whether a requirement was implemented in a piece of code (i.e., a trace link exists), no significant effect was observed when tracing the requirements with the associated lines of code. These findings suggest that requirements smells can affect LLM performance in certain SE tasks but may not uniformly impact all tasks. We highlight the need for further research to understand these nuances and propose future work toward developing guidelines for mitigating the negative effects of requirements smells in AI-driven SE processes.