Large Language Models for Full-Text Methods Assessment: A Case Study on Mediation Analysis

📅 2025-10-12

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Large language models (LLMs) are increasingly deployed to automate methodological information extraction from full-text scientific papers in systematic reviews, yet their reliability—particularly for tasks requiring causal reasoning—remains poorly understood. Method: We evaluated state-of-the-art LLMs on 180 empirical papers using expert human annotation as the gold standard, benchmarking performance on two tasks: explicit methodology identification and implicit causal mediation analysis requiring inference. Contribution/Results: LLMs achieve near-human performance on explicit method recognition (F1 correlation = 0.97), but underperform by 15% on causal-reasoning tasks and exhibit marked degradation with longer texts. Critically, errors stem from overreliance on superficial linguistic cues rather than deep methodological logic. This study provides the first systematic empirical characterization of LLMs’ reasoning bottlenecks in methodological assessment, offering foundational evidence and concrete directions for developing trustworthy AI-assisted systematic review tools.

Technology Category

Application Category

📝 Abstract

Systematic reviews are crucial for synthesizing scientific evidence but remain labor-intensive, especially when extracting detailed methodological information. Large language models (LLMs) offer potential for automating methodological assessments, promising to transform evidence synthesis. Here, using causal mediation analysis as a representative methodological domain, we benchmarked state-of-the-art LLMs against expert human reviewers across 180 full-text scientific articles. Model performance closely correlated with human judgments (accuracy correlation 0.71; F1 correlation 0.97), achieving near-human accuracy on straightforward, explicitly stated methodological criteria. However, accuracy sharply declined on complex, inference-intensive assessments, lagging expert reviewers by up to 15%. Errors commonly resulted from superficial linguistic cues -- for instance, models frequently misinterpreted keywords like "longitudinal" or "sensitivity" as automatic evidence of rigorous methodological approache, leading to systematic misclassifications. Longer documents yielded lower model accuracy, whereas publication year showed no significant effect. Our findings highlight an important pattern for practitioners using LLMs for methods review and synthesis from full texts: current LLMs excel at identifying explicit methodological features but require human oversight for nuanced interpretations. Integrating automated information extraction with targeted expert review thus provides a promising approach to enhance efficiency and methodological rigor in evidence synthesis across diverse scientific fields.

Problem

Research questions and friction points this paper is trying to address.

Automating methodological assessment in systematic reviews using LLMs

Benchmarking LLM performance against human experts on full-text articles

Addressing LLM limitations in complex inference-intensive methodological evaluations

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs benchmarked against human reviewers for mediation analysis

Models excel at identifying explicit methodological features

Integration of automated extraction with expert review

🔎 Similar Papers

LLM Reading Tea Leaves: Automatically Evaluating Topic Models with Large Language Models