π€ AI Summary
This work addresses the lack of benchmarks for evaluating large language modelsβ ability to reason scientific conclusions from structured biomedical evidence. The authors introduce the first large-scale benchmark dataset for biomedical conclusion generation, comprising 5.7 million structured PubMed abstracts in which background, methods, and results sections are paired with author-written conclusions, augmented with journal metadata to enable cross-domain analysis. Through a combination of automatic metrics and multi-dimensional LLM-as-a-judge evaluations, the study reveals fundamental differences between conclusion generation and general summarization, demonstrates performance convergence among current strong models on automatic metrics, and shows that evaluator identity significantly influences human-like judgments. This resource provides a reusable infrastructure for advancing research in scientific reasoning.
π Abstract
Large language models (LLMs) are widely explored for reasoning-intensive research tasks, yet resources for testing whether they can infer scientific conclusions from structured biomedical evidence remain limited. We introduce $\textbf{MedConclusion}$, a large-scale dataset of $\textbf{5.7M}$ PubMed structured abstracts for biomedical conclusion generation. Each instance pairs the non-conclusion sections of an abstract with the original author-written conclusion, providing naturally occurring supervision for evidence-to-conclusion reasoning. MedConclusion also includes journal-level metadata such as biomedical category and SJR, enabling subgroup analysis across biomedical domains. As an initial study, we evaluate diverse LLMs under conclusion and summary prompting settings and score outputs with both reference-based metrics and LLM-as-a-judge. We find that conclusion writing is behaviorally distinct from summary writing, strong models remain closely clustered under current automatic metrics, and judge identity can substantially shift absolute scores. MedConclusion provides a reusable data resource for studying scientific evidence-to-conclusion reasoning. Our code and data are available at: https://github.com/Harvard-AI-and-Robotics-Lab/MedConclusion.