🤖 AI Summary
This study addresses the inefficiency of traditional requirement engineering approaches that rely on manual annotation for extracting usability requirements from user reviews. To overcome this limitation, the authors propose a prompt engineering method leveraging large language models (LLMs) guided by Nielsen’s ten usability heuristics. They introduce, for the first time, a specialized prompt template tailored specifically for usability requirements and construct a dual-annotated dataset comprising 300 user reviews across multiple application categories, labeled both manually and by LLMs. Experimental results demonstrate that, with carefully designed prompts, LLMs achieve an F-score comparable to human annotators in identifying usability-related non-functional requirements, thereby confirming the feasibility and cost-effectiveness of the approach while underscoring the critical role of prompt design in model performance.
📝 Abstract
It is known that user-centered approaches to requirements engineering in general lead to a better suited product for the end-users. LLM4RE provides promising approaches to support the requirements elicitation process (e.g. classification of requirements). Previous approaches focus on Machine-Learning (ML) or Deep-Learning (DL) aspects, which require intensive training with a large amount of manually labeled data. LLMs, on the other hand, are pre-trained on large amounts of user-generated text data, enabling a user-centric workflow to analyze requirements. In this paper, we explore the possibility of exploiting the improved natural language understanding of LLMs, rather than strict ML classification, together with the mass extraction of user reviews to analyze if the performance of LLMs in understanding user reviews is comparable to the performance of human raters. This enables a quick and cheap workflow for development teams to gather and process their userś requirements. This paper provides three major contributions: (1) We provide a completely coded dataset of 300 user reviews containing usability-relevant aspects from three different types of apps, that were labeled by two human raters and by an LLM. (2) We build an initial prompt, based on two prompt engineering iterations and specifically developed coding guidelines derived from the 10 Nielsen Usability Heuristics, for LLMs to filter usability relevant user reviews. (3) We determine that LLMs are generally able to recognize usability as a non-functional requirement in user reviews, in terms of their F-score, but the performance and reliability is strongly dependent on the prompt.