🤖 AI Summary
This study addresses the high cost and risk of omission in screening studies for systematic literature reviews, which threaten validity, and the lack of systematic evaluation of large language models’ (LLMs’) performance and variability in this task. It presents the first systematic quantification of screening effectiveness across 12 prominent LLMs—from OpenAI, Google Gemini, Anthropic, and Llama—and four classical classifiers on two real-world software engineering reviews. The work examines the impact of input metadata combinations, model non-determinism—even at zero temperature—and comparative performance against traditional methods. Findings reveal substantial LLM variability despite deterministic settings, the critical importance of abstracts (with limited gains from titles or keywords), and no consistent superiority of LLMs over conventional classifiers, thereby challenging assumptions of their universal advantage.
📝 Abstract
Context: Study screening in systematic literature reviews is costly, inconsistency-prone, and risk-asymmetric, since false negatives can compromise validity. Despite rapid uptake of Large Language Models (LLMs), there is limited evidence on how such models behave during the study screening phase, particularly regarding the choice of specific LLMs and their comparison with classical models. Objective: To assess LLM performance and variability in screening, quantify the impact of input metadata (abstract, title, keywords), and compare LLMs with classical classifiers under a shared protocol. Methods: We analyzed 12 LLMs from 4 providers (OpenAI, Google Gemini, Anthropic, Llama) and 4 classical models (Logistic Regression, Support Vector Classification, Random Forest, and Naive Bayes) on 2 real Systematic Literature Reviews (SLRs), totaling 518 papers. The experimental design investigated 3 critical dimensions: (i) LLMs performance variability, (ii) the impact of input feature composition (abstract, title, and keywords) on LLM performance, and (iii) the real gain of using LLMs instead of more traditional classification models. Results: LLMs exhibited substantial heterogeneity and residual non-determinism even at temperature zero. Abstract availability was decisive: removing it consistently degraded performance, while adding title and/or keywords to the abstract yielded no robust gains. Compared to classical models, performance differences were not consistent enough to support generalizable LLM superiority. Discussion: LLM adoption should be justified by operational and governance constraints (reproducibility, cost, metadata availability), supported by pilot validation and explicit reporting of variability and input configuration.