🤖 AI Summary
There is a lack of standardized benchmarks to evaluate the effectiveness of large language models (LLMs) in title–abstract screening for software engineering systematic reviews (SRs).
Method: We construct the first high-quality, task-specific benchmark dataset comprising 24 SRs and 34,528 manually labeled publications. Using nine state-of-the-art LLMs, we conduct multi-round experiments assessing precision, recall, and per-screening cost (<$40).
Contribution/Results: Results reveal substantial performance variation across SRs—exceeding inter-model differences—highlighting task heterogeneity as a key challenge. No current LLM achieves both high recall (>95%) and acceptable precision (>70%), rendering fully automated screening infeasible. This work establishes the first dedicated benchmark for AI-assisted screening in software engineering SRs, providing a reproducible, scalable evaluation framework and empirical evidence to guide future research and practical deployment.
📝 Abstract
Background: The use of large language models (LLMs) in the title-abstract screening process of systematic reviews (SRs) has shown promising results, but suffers from limited performance evaluation. Aims: Create a benchmark dataset to evaluate the performance of LLMs in the title-abstract screening process of SRs. Provide evidence whether using LLMs in title-abstract screening in software engineering is advisable. Method: We start with 169 SR research artifacts and find 24 of those to be suitable for inclusion in the dataset. Using the dataset we benchmark title-abstract screening using 9 LLMs. Results: We present the SESR-Eval (Software Engineering Systematic Review Evaluation) dataset containing 34,528 labeled primary studies, sourced from 24 secondary studies published in software engineering (SE) journals. Most LLMs performed similarly and the differences in screening accuracy between secondary studies are greater than differences between LLMs. The cost of using an LLM is relatively low - less than $40 per secondary study even for the most expensive model. Conclusions: Our benchmark enables monitoring AI performance in the screening task of SRs in software engineering. At present, LLMs are not yet recommended for automating the title-abstract screening process, since accuracy varies widely across secondary studies, and no LLM managed a high recall with reasonable precision. In future, we plan to investigate factors that influence LLM screening performance between studies.