MedMeta: A Benchmark for LLMs in Synthesizing Meta-Analysis Conclusion from Medical Studies

📅 2026-05-10

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This study addresses the limited capacity of current large language models (LLMs) in medicine to perform higher-order synthetic reasoning over multi-source research evidence, as they predominantly rely on factual recall. The authors introduce MedMeta, the first benchmark specifically designed for evaluating medical evidence synthesis, comprising 81 PubMed-indexed meta-analyses, and propose two evaluation protocols: Golden-RAG and purely parametric inference. Using an LLM-as-a-judge framework alongside Pearson correlation and Bland-Altman analysis, they demonstrate strong agreement between LLM-generated scores and human expert judgments. Results reveal that Golden-RAG substantially outperforms parametric approaches, domain-specific fine-tuning yields marginal gains, and all models fail to recognize negated evidence. Even under ideal conditions, state-of-the-art LLMs achieve only modest performance (~2.7/5.0), underscoring the critical role of anchored information retrieval in clinical reasoning.

📝 Abstract

Large language models (LLMs) have saturated standard medical benchmarks that test factual recall, yet their ability to perform higher-order reasoning, such as synthesizing evidence from multiple sources, remains critically under-explored. To address this gap, we introduce MedMeta, the first benchmark designed to evaluate an LLM's ability to generate conclusions from medical meta-analyses using only the abstracts of cited studies. MedMeta comprises 81 meta-analyses from PubMed (2018--2025) and evaluates models using two distinct workflows: a Retrieval-Augmented Generation (Golden-RAG) setting with ground-truth abstracts, and a Parametric-only approach relying on internal knowledge. Our evaluation framework is validated by a well-structured analysis showing our LLM-as-a-judge protocol strongly aligns with human expert ratings, as evidenced by high Pearson's r correlation (0.81) and Bland-Altman analysis revealing negligible systematic bias, establishing it as a reliable proxy for scalable evaluation. Our findings underscore the critical importance of information grounding: the Golden-RAG workflow consistently and significantly outperforms the Parametric-only approach across models. In contrast, the benefits of domain-specific fine-tuning are marginal and largely neutralized when external material is provided. Furthermore, stress tests show that all models, regardless of architecture, fail to identify and reject negated evidence, highlighting a critical vulnerability in current RAG systems. Notably, even under ideal RAG conditions, current LLMs achieve only slightly above-average performance (~2.7/5.0). MedMeta provides a challenging new benchmark for evidence synthesis and demonstrates that for clinical applications, developing robust RAG systems is a more promising direction than model specialization alone.

Problem

Research questions and friction points this paper is trying to address.

meta-analysis

evidence synthesis

large language models

medical reasoning

retrieval-augmented generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

meta-analysis synthesis

retrieval-augmented generation

LLM-as-a-judge

evidence grounding

medical reasoning benchmark

🔎 Similar Papers

No similar papers found.