MedConclusion: A Benchmark for Biomedical Conclusion Generation from Structured Abstracts

πŸ“… 2026-04-07
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the lack of benchmarks for evaluating large language models’ ability to reason scientific conclusions from structured biomedical evidence. The authors introduce the first large-scale benchmark dataset for biomedical conclusion generation, comprising 5.7 million structured PubMed abstracts in which background, methods, and results sections are paired with author-written conclusions, augmented with journal metadata to enable cross-domain analysis. Through a combination of automatic metrics and multi-dimensional LLM-as-a-judge evaluations, the study reveals fundamental differences between conclusion generation and general summarization, demonstrates performance convergence among current strong models on automatic metrics, and shows that evaluator identity significantly influences human-like judgments. This resource provides a reusable infrastructure for advancing research in scientific reasoning.
πŸ“ Abstract
Large language models (LLMs) are widely explored for reasoning-intensive research tasks, yet resources for testing whether they can infer scientific conclusions from structured biomedical evidence remain limited. We introduce $\textbf{MedConclusion}$, a large-scale dataset of $\textbf{5.7M}$ PubMed structured abstracts for biomedical conclusion generation. Each instance pairs the non-conclusion sections of an abstract with the original author-written conclusion, providing naturally occurring supervision for evidence-to-conclusion reasoning. MedConclusion also includes journal-level metadata such as biomedical category and SJR, enabling subgroup analysis across biomedical domains. As an initial study, we evaluate diverse LLMs under conclusion and summary prompting settings and score outputs with both reference-based metrics and LLM-as-a-judge. We find that conclusion writing is behaviorally distinct from summary writing, strong models remain closely clustered under current automatic metrics, and judge identity can substantially shift absolute scores. MedConclusion provides a reusable data resource for studying scientific evidence-to-conclusion reasoning. Our code and data are available at: https://github.com/Harvard-AI-and-Robotics-Lab/MedConclusion.
Problem

Research questions and friction points this paper is trying to address.

biomedical conclusion generation
scientific reasoning
structured abstracts
large language models
evidence-to-conclusion inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

biomedical conclusion generation
structured abstracts
evidence-to-conclusion reasoning
large language models
scientific text benchmark
πŸ”Ž Similar Papers
No similar papers found.
W
Weiyue Li
Harvard AI and Robotics Lab, Harvard Medical School
R
Ruizhi Qian
University of Southern California
Y
Yi Li
Carnegie Mellon University
Y
Yongce Li
Stanford University
Y
Yunfan Long
Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University
J
Jiahui Cai
Harvard AI and Robotics Lab, Harvard Medical School
Yan Luo
Yan Luo
Harvard University
Computer VisionMachine LearningBiomedical ImagingAI for Medicine
Mengyu Wang
Mengyu Wang
Assistant Professor, Harvard Medical School
Artificial IntelligenceMachine LearningOphthalmologyGlaucomaComputational Mechanics