LLM-Assisted Abstract Screening with OLIVER: Evaluating Calibration and Single-Model vs. Actor-Critic Configurations in Literature Reviews

📅 2025-12-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Prior LLM-assisted screening studies rely on outdated models, Cochrane datasets, and fixed configurations, overlooking calibration, generalizability, and configuration sensitivity. To address these limitations, we propose OLIVER—an open-source framework designed for real-world, non-Cochrane systematic reviews. OLIVER is the first to empirically expose severe miscalibration (quantified by Expected Calibration Error, ECE) in mainstream LLMs for screening tasks. Methodologically, it introduces a lightweight Actor-Critic dual-model architecture that synergistically integrates prompt engineering with multi-strategy ensemble scoring—namely, majority voting, confidence-weighted aggregation, and threshold-based fusion—to jointly enhance discriminative accuracy and confidence reliability. Evaluated on two real-world systematic review projects, OLIVER reduces ECE by 37–62%, improves AUC by 0.08–0.15, and maintains high specificity and scalability—demonstrating robust performance beyond synthetic or domain-restricted benchmarks.

Technology Category

Application Category

📝 Abstract
Introduction: Recent work suggests large language models (LLMs) can accelerate screening, but prior evaluations focus on earlier LLMs, standardized Cochrane reviews, single-model setups, and accuracy as the primary metric, leaving generalizability, configuration effects, and calibration largely unexamined. Methods: We developed OLIVER (Optimized LLM-based Inclusion and Vetting Engine for Reviews), an open-source pipeline for LLM-assisted abstract screening. We evaluated multiple contemporary LLMs across two non-Cochrane systematic reviews and performance was assessed at both the full-text screening and final inclusion stages using accuracy, AUC, and calibration metrics. We further tested an actor-critic screening framework combining two lightweight models under three aggregation rules. Results: Across individual models, performance varied widely. In the smaller Review 1 (821 abstracts, 63 final includes), several models achieved high sensitivity for final includes but at the cost of substantial false positives and poor calibration. In the larger Review 2 (7741 abstracts, 71 final includes), most models were highly specific but struggled to recover true includes, with prompt design influencing recall. Calibration was consistently weak across single-model configurations despite high overall accuracy. Actor-critic screening improved discrimination and markedly reduced calibration error in both reviews, yielding higher AUCs. Discussion: LLMs may eventually accelerate abstract screening, but single-model performance is highly sensitive to review characteristics, prompting, and calibration is limited. An actor-critic framework improves classification quality and confidence reliability while remaining computationally efficient, enabling large-scale screening at low cost.
Problem

Research questions and friction points this paper is trying to address.

Evaluates LLM calibration and performance in abstract screening
Compares single-model versus actor-critic configurations for reviews
Assesses generalizability across non-Cochrane systematic literature reviews
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source pipeline OLIVER for LLM-assisted abstract screening
Evaluated multiple LLMs using accuracy, AUC, and calibration metrics
Actor-critic framework with lightweight models improves classification and calibration
🔎 Similar Papers