🤖 AI Summary
This study systematically evaluates the applicability of mainstream large language models (LLMs)—namely GPT-3.5 Turbo, GPT-4 Turbo, and Val—to Scrum iteration planning, focusing on three core tasks: user story estimation, task decomposition, and sprint goal generation. Using a manually curated, real-world annotated dataset, we employ a mixed qualitative and quantitative evaluation to empirically assess LLM outputs across accuracy, consistency, and operational feasibility—the first such investigation into engineering-grade usability in agile planning. Results indicate that current LLM outputs do not yet meet the threshold for direct production deployment. Key contributions include: (1) an empirical delineation of LLM capabilities and limitations in Scrum planning; (2) a practical hybrid enhancement framework combining rule-based engines with lightweight fine-tuning; and (3) the first empirically grounded benchmark and improvement roadmap for LLMs in Scrum iteration planning—advancing the integration of foundation models into software engineering practice.
📝 Abstract
Planning for an upcoming project iteration (sprint) is one of the key activities in Scrum planning. In this paper, we present our work in progress on exploring the applicability of Large Language Models (LLMs) for solving this problem. We conducted case studies with manually created data sets to investigate the applicability of OpenAI models for supporting the sprint planning activities. In our experiments, we applied three models provided OpenAI: GPT-3.5 Turbo, GPT-4.0 Turbo, and Val. The experiments demonstrated that the results produced by the models aren't of acceptable quality for direct use in Scrum projects.