π€ AI Summary
This study addresses the challenge of story point estimation in agile software development, where traditional methods rely heavily on large volumes of labeled data and struggle in new or data-scarce projects. The authors investigate the use of large language models (LLMs) to directly predict story points for user stories under zero-shot and few-shot settings, demonstrating for the first time that LLMs can outperform deep learning models trained on 80% of the available dataβeven without any task-specific training. The work proposes a novel few-shot prompting strategy that incorporates relative effort comparisons as in-context examples, significantly boosting prediction accuracy. Comprehensive experiments across 16 software projects with four prominent LLMs show that zero-shot LLMs already achieve competitive performance, while the proposed comparison-based few-shot approach further enhances results, offering a promising new paradigm for agile estimation in low-resource scenarios.
π Abstract
This study investigates the use of large language models (LLMs) for story point estimation. Story points are unitless, project-specific effort estimates that help developers on the scrum team forecast which product backlog items they plan to complete in a sprint. To facilitate this process, machine learning models, especially deep neural networks, have been applied to predict the story points based on the title and description of each item. However, such machine learning models require sufficient amounts of training data (with ground truth story points annotated by human developers) from the same software project to achieve decent prediction performance. This motivated us to explore whether LLMs are capable of (RQ1) predicting story points without training data or (RQ2) with only a few training data points. Our empirical results with four LLMs on 16 software projects show that, without any training data (zero-shot prompting), LLMs can predict story points better than supervised deep learning models trained on 80% of the data. The prediction performance of LLMs can be further improved with a few training examples (few-shot prompting). In addition, a recent study explored the use of comparative judgments (between a given pair of items which one requires more effort to implement) instead of directly annotating the story points to reduce the cognitive burden on developers. Therefore, this study also explores (RQ3) whether comparative judgments are easier to predict than story points for LLMs and (RQ4) whether comparative judgments can serve as few-shot examples for LLMs to improve their predictions of story points. Empirical results suggest that it is not easier for LLMs to predict comparative judgments than to directly estimate the story points, but comparative judgments can serve as few-shot examples to improve the LLMs'prediction performance as well as the human-annotated story points.