Large Language Models for Full-Text Methods Assessment: A Case Study on Mediation Analysis

📅 2025-10-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) are increasingly deployed to automate methodological information extraction from full-text scientific papers in systematic reviews, yet their reliability—particularly for tasks requiring causal reasoning—remains poorly understood. Method: We evaluated state-of-the-art LLMs on 180 empirical papers using expert human annotation as the gold standard, benchmarking performance on two tasks: explicit methodology identification and implicit causal mediation analysis requiring inference. Contribution/Results: LLMs achieve near-human performance on explicit method recognition (F1 correlation = 0.97), but underperform by 15% on causal-reasoning tasks and exhibit marked degradation with longer texts. Critically, errors stem from overreliance on superficial linguistic cues rather than deep methodological logic. This study provides the first systematic empirical characterization of LLMs’ reasoning bottlenecks in methodological assessment, offering foundational evidence and concrete directions for developing trustworthy AI-assisted systematic review tools.

Technology Category

Application Category

📝 Abstract
Systematic reviews are crucial for synthesizing scientific evidence but remain labor-intensive, especially when extracting detailed methodological information. Large language models (LLMs) offer potential for automating methodological assessments, promising to transform evidence synthesis. Here, using causal mediation analysis as a representative methodological domain, we benchmarked state-of-the-art LLMs against expert human reviewers across 180 full-text scientific articles. Model performance closely correlated with human judgments (accuracy correlation 0.71; F1 correlation 0.97), achieving near-human accuracy on straightforward, explicitly stated methodological criteria. However, accuracy sharply declined on complex, inference-intensive assessments, lagging expert reviewers by up to 15%. Errors commonly resulted from superficial linguistic cues -- for instance, models frequently misinterpreted keywords like "longitudinal" or "sensitivity" as automatic evidence of rigorous methodological approache, leading to systematic misclassifications. Longer documents yielded lower model accuracy, whereas publication year showed no significant effect. Our findings highlight an important pattern for practitioners using LLMs for methods review and synthesis from full texts: current LLMs excel at identifying explicit methodological features but require human oversight for nuanced interpretations. Integrating automated information extraction with targeted expert review thus provides a promising approach to enhance efficiency and methodological rigor in evidence synthesis across diverse scientific fields.
Problem

Research questions and friction points this paper is trying to address.

Automating methodological assessment in systematic reviews using LLMs
Benchmarking LLM performance against human experts on full-text articles
Addressing LLM limitations in complex inference-intensive methodological evaluations
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs benchmarked against human reviewers for mediation analysis
Models excel at identifying explicit methodological features
Integration of automated extraction with expert review
Wenqing Zhang
Wenqing Zhang
WASHINGTON UNIVERSITY IN ST. LOUIS
AIMLComputer VisionAutonomous Driving
Trang Nguyen
Trang Nguyen
Technical Staff, MIT Lincoln Laboratory
Natural Language ProcessingLarge Language ModelsExplainable AICyber Analytics
E
Elizabeth A. Stuart
Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205, United States; Department of Mental Health, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205, United States
Y
Yiqun T. Chen
Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205, United States; Department of Computer Science, Johns Hopkins Whiting School of Engineering, Baltimore, MD 21218, United States