Does Model Size Matter? A Comparison of Small and Large Language Models for Requirements Classification

📅 2025-10-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether small language models (SLMs) can effectively replace large language models (LLMs) for classification tasks in requirements engineering—balancing performance, efficiency, and security. We systematically evaluate eight models (three LLMs and five SLMs) across three benchmark datasets—PROMISE, PROMISE Reclass, and SecReq—covering a 300× parameter range, multilingual settings, and multiple metrics (e.g., F1-score, recall). Results show that SLMs achieve comparable performance: their average F1-score is only 2% lower than LLMs (p > 0.05, statistically insignificant), and they even outperform LLMs in recall on PROMISE Reclass. Crucially, model performance is driven primarily by data characteristics rather than scale. This work provides the first empirical evidence of SLMs’ strong competitiveness in requirements classification, establishing them as viable, lightweight, controllable, and deployable alternatives for practical RE tooling.

Technology Category

Application Category

📝 Abstract
[Context and motivation] Large language models (LLMs) show notable results in natural language processing (NLP) tasks for requirements engineering (RE). However, their use is compromised by high computational cost, data sharing risks, and dependence on external services. In contrast, small language models (SLMs) offer a lightweight, locally deployable alternative. [Question/problem] It remains unclear how well SLMs perform compared to LLMs in RE tasks in terms of accuracy. [Results] Our preliminary study compares eight models, including three LLMs and five SLMs, on requirements classification tasks using the PROMISE, PROMISE Reclass, and SecReq datasets. Our results show that although LLMs achieve an average F1 score of 2% higher than SLMs, this difference is not statistically significant. SLMs almost reach LLMs performance across all datasets and even outperform them in recall on the PROMISE Reclass dataset, despite being up to 300 times smaller. We also found that dataset characteristics play a more significant role in performance than model size. [Contribution] Our study contributes with evidence that SLMs are a valid alternative to LLMs for requirements classification, offering advantages in privacy, cost, and local deployability.
Problem

Research questions and friction points this paper is trying to address.

Comparing small and large language models for requirements classification accuracy
Evaluating performance differences between SLMs and LLMs using multiple datasets
Investigating whether dataset characteristics impact performance more than model size
Innovation

Methods, ideas, or system contributions that make the work stand out.

SLMs provide lightweight local deployment alternative
SLMs match LLM performance with smaller size
Dataset characteristics outweigh model size impact
🔎 Similar Papers
No similar papers found.
M
Mohammad Amin Zadenoori
University of Padova, Italy
V
Vincenzo De Martino
Software Engineering (SeSa) Lab, University of Salerno, Italy
J
Jacek Dabrowski
Lero, the Research Ireland Centre for Software, University of Limerick, Ireland
X
Xavier Franch
Universitat Politècnica de Catalunya, Spain
Alessio Ferrari
Alessio Ferrari
Lecturer, UCD; Senior Research Scientist, ISTI CNR
Natural Language ProcessingRequirements EngineeringRequirements ElicitationFormal Methods