Text2Stories: Evaluating the Alignment Between Stakeholder Interviews and Generated User Stories

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

This work addresses the challenge of evaluating fidelity in automatically generated user stories from stakeholder interview transcripts. Methodologically, we propose the first text-to-story alignment assessment framework, introducing dual-dimensional metrics—correctness and completeness—and leveraging large language models for fine-grained semantic matching, augmented by embedding models for efficient candidate pruning and segment-level alignment modeling. Our key contribution lies in reframing requirements validation as a quantifiable, scalable, structured alignment task. Experimental evaluation across four real-world datasets demonstrates that our approach achieves a macro-F1 score of 0.86—significantly outperforming baseline methods—and enables rigorous quality comparison between generated and manually authored user stories. This establishes a novel paradigm for automated, alignment-based verification in requirements engineering.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) can be employed for automating the generation of software requirements from natural language inputs such as the transcripts of elicitation interviews. However, evaluating whether those derived requirements faithfully reflect the stakeholders' needs remains a largely manual task. We introduce Text2Stories, a task and metrics for text-to-story alignment that allow quantifying the extent to which requirements (in the form of user stories) match the actual needs expressed by the elicitation session participants. Given an interview transcript and a set of user stories, our metric quantifies (i) correctness: the proportion of stories supported by the transcript, and (ii) completeness: the proportion of transcript supported by at least one story. We segment the transcript into text chunks and instantiate the alignment as a matching problem between chunks and stories. Experiments over four datasets show that an LLM-based matcher achieves 0.86 macro-F1 on held-out annotations, while embedding models alone remain behind but enable effective blocking. Finally, we show how our metrics enable the comparison across sets of stories (e.g., human vs. generated), positioning Text2Stories as a scalable, source-faithful complement to existing user-story quality criteria.

Problem

Research questions and friction points this paper is trying to address.

Evaluating alignment between stakeholder interviews and generated user stories

Quantifying correctness and completeness of automated requirements generation

Providing scalable metrics for comparing human versus AI-generated stories

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automates user story generation from interview transcripts

Quantifies alignment via correctness and completeness metrics

Uses LLM-based matching with text chunk segmentation

🔎 Similar Papers

Prototyping with Prompts: Emerging Approaches and Challenges in Generative AI Design for Collaborative Software Teams