🤖 AI Summary
Large language models (LLMs) are increasingly deployed to automate methodological information extraction from full-text scientific papers in systematic reviews, yet their reliability—particularly for tasks requiring causal reasoning—remains poorly understood.
Method: We evaluated state-of-the-art LLMs on 180 empirical papers using expert human annotation as the gold standard, benchmarking performance on two tasks: explicit methodology identification and implicit causal mediation analysis requiring inference.
Contribution/Results: LLMs achieve near-human performance on explicit method recognition (F1 correlation = 0.97), but underperform by 15% on causal-reasoning tasks and exhibit marked degradation with longer texts. Critically, errors stem from overreliance on superficial linguistic cues rather than deep methodological logic. This study provides the first systematic empirical characterization of LLMs’ reasoning bottlenecks in methodological assessment, offering foundational evidence and concrete directions for developing trustworthy AI-assisted systematic review tools.
📝 Abstract
Systematic reviews are crucial for synthesizing scientific evidence but remain labor-intensive, especially when extracting detailed methodological information. Large language models (LLMs) offer potential for automating methodological assessments, promising to transform evidence synthesis. Here, using causal mediation analysis as a representative methodological domain, we benchmarked state-of-the-art LLMs against expert human reviewers across 180 full-text scientific articles. Model performance closely correlated with human judgments (accuracy correlation 0.71; F1 correlation 0.97), achieving near-human accuracy on straightforward, explicitly stated methodological criteria. However, accuracy sharply declined on complex, inference-intensive assessments, lagging expert reviewers by up to 15%. Errors commonly resulted from superficial linguistic cues -- for instance, models frequently misinterpreted keywords like "longitudinal" or "sensitivity" as automatic evidence of rigorous methodological approache, leading to systematic misclassifications. Longer documents yielded lower model accuracy, whereas publication year showed no significant effect. Our findings highlight an important pattern for practitioners using LLMs for methods review and synthesis from full texts: current LLMs excel at identifying explicit methodological features but require human oversight for nuanced interpretations. Integrating automated information extraction with targeted expert review thus provides a promising approach to enhance efficiency and methodological rigor in evidence synthesis across diverse scientific fields.