Fine-Tuning A Large Language Model for Systematic Review Screening

πŸ“… 2026-03-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study addresses the labor-intensive and time-consuming nature of title and abstract screening in systematic reviews, a task for which existing large language models (LLMs) exhibit unstable performance when relying solely on prompting. For the first time, we demonstrate the effectiveness of supervised fine-tuning a small-scale open-source LLM (1.2B parameters) on this task, using over 8,500 human-annotated samples to achieve domain adaptation. The fine-tuned model achieves an 80.79% improvement in weighted F1 score, showing 86.40% agreement with human coders across 8,277 studies, with a true positive rate of 91.18% and a true negative rate of 86.38%. Critically, the model’s inference results are fully reproducible and significantly outperform prompt-only approaches.

Technology Category

Application Category

πŸ“ Abstract
Systematic reviews traditionally have taken considerable amounts of human time and energy to complete, in part due to the extensive number of titles and abstracts that must be reviewed for potential inclusion. Recently, researchers have begun to explore how to use large language models (LLMs) to make this process more efficient. However, research to date has shown inconsistent results. We posit this is because prompting alone may not provide sufficient context for the model(s) to perform well. In this study, we fine-tune a small 1.2 billion parameter open-weight LLM specifically for study screening in the context of a systematic review in which humans rated more than 8500 titles and abstracts for potential inclusion. Our results showed strong performance improvements from the fine-tuned model, with the weighted F1 score improving 80.79% compared to the base model. When run on the full dataset of 8,277 studies, the fine-tuned model had 86.40% agreement with the human coder, a 91.18% true positive rate, a 86.38% true negative rate, and perfect agreement across multiple inference runs. Taken together, our results show that there is promise for fine-tuning LLMs for title and abstract screening in large-scale systematic reviews.
Problem

Research questions and friction points this paper is trying to address.

systematic review
study screening
large language models
title and abstract screening
human annotation
Innovation

Methods, ideas, or system contributions that make the work stand out.

fine-tuning
large language models
systematic review
study screening
title and abstract screening
πŸ”Ž Similar Papers
No similar papers found.
K
Kweku Yamoah
University of Florida, Gainesville, FL, 32611
N
Noah Schroeder
University of Florida, Gainesville, FL, 32611
E
Emmanuel Dorley
University of Florida, Gainesville, FL, 32611
Neha Rani
Neha Rani
Assistant Instructional Professor at University of Florida
Human-Computer InteractionAI in EducationContext-Aware Recommender SystemsSmart Wearables
C
Caleb Schutz
University of Florida, Gainesville, FL, 32611