From Reviews to Requirements: Can LLMs Generate Human-Like User Stories?

📅 2026-03-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of leveraging unstructured and noisy app store reviews for agile software development by systematically evaluating the capability of leading large language models (LLMs) to automatically generate high-quality user stories from real-world health-related app reviews. Employing zero-shot to two-shot prompting strategies, the generated user stories are rigorously assessed through human evaluation and a RoBERTa-based classifier fine-tuned on the UStAI benchmark. Results demonstrate that, under few-shot settings, models such as GPT-3.5 Turbo, Gemini 2.0 Flash, and Mistral 7B Instruct produce user stories whose fluency and adherence to standard formatting meet or exceed human performance, offering developers an efficient and actionable pathway for requirement elicitation. However, the study also reveals persistent limitations in ensuring story independence and uniqueness.
📝 Abstract
App store reviews provide a constant flow of real user feedback that can help improve software requirements. However, these reviews are often messy, informal, and difficult to analyze manually at scale. Although automated techniques exist, many do not perform well when replicated and often fail to produce clean, backlog-ready user stories for agile projects. In this study, we evaluate how well large language models (LLMs) such as GPT-3.5 Turbo, Gemini 2.0 Flash, and Mistral 7B Instruct can generate usable user stories directly from raw app reviews. Using the Mini-BAR dataset of 1,000+ health app reviews, we tested zero-shot, one-shot, and two-shot prompting methods. We evaluated the generated user stories using both human judgment (via the RUST framework) and a RoBERTa classifier fine-tuned on UStAI to assess their overall quality. Our results show that LLMs can match or even outperform humans in writing fluent, well-formatted user stories, especially when few-shot prompts are used. However, they still struggle to produce independent and unique user stories, which are essential for building a strong agile backlog. Overall, our findings show how LLMs can reliably turn unstructured app reviews into actionable software requirements, providing developers with clear guidance to turn user feedback into meaningful improvements.
Problem

Research questions and friction points this paper is trying to address.

app store reviews
user stories
software requirements
agile backlog
natural language processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models
User Story Generation
App Store Reviews
Agile Requirements
Few-shot Prompting
🔎 Similar Papers
No similar papers found.
S
Shadman Sakib
Department of Computer Science and Engineering, Islamic University of Technology (IUT), Gazipur, Bangladesh
O
Oishy Fatema Akhand
Department of Computer Science and Engineering, Islamic University of Technology (IUT), Gazipur, Bangladesh
T
Tasnia Tasneem
Department of Computer Science and Engineering, Islamic University of Technology (IUT), Gazipur, Bangladesh
Shohel Ahmed
Shohel Ahmed
Assistant Professor of CSE, Islamic University of Technology (IUT)