EvidenceBench: A Benchmark for Extracting Evidence from Biomedical Papers

πŸ“… 2025-04-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This paper addresses the challenge of automatically extracting evidence required for hypothesis verification in biomedical literature by introducing EvidenceBench, the first benchmark dedicated to hypothesis-verification evidence extraction. Methodologically, it proposes an expert-driven end-to-end annotation pipeline: domain-knowledge-informed generation of verifiable hypotheses, followed by multi-round expert consensus annotation to label each sentence as supporting or refuting evidence. The resulting large-scale training set, EvidenceBench-100k, comprises fully annotated full texts from 107,000 biomedical papers. Key contributions include (i) an annotation paradigm strictly grounded in expert judgment, (ii) a domain-aware hypothesis generation strategy, and (iii) a novel evaluation framework integrating language models with retrieval systems. Experiments reveal a substantial performance gap between current models and human experts. Both datasets are publicly released, establishing foundational resources for hypothesis-driven AI research in biomedicine.

Technology Category

Application Category

πŸ“ Abstract
We study the task of automatically finding evidence relevant to hypotheses in biomedical papers. Finding relevant evidence is an important step when researchers investigate scientific hypotheses. We introduce EvidenceBench to measure models performance on this task, which is created by a novel pipeline that consists of hypothesis generation and sentence-by-sentence annotation of biomedical papers for relevant evidence, completely guided by and faithfully following existing human experts judgment. We demonstrate the pipeline's validity and accuracy with multiple sets of human-expert annotations. We evaluated a diverse set of language models and retrieval systems on the benchmark and found that model performances still fall significantly short of the expert level on this task. To show the scalability of our proposed pipeline, we create a larger EvidenceBench-100k with 107,461 fully annotated papers with hypotheses to facilitate model training and development. Both datasets are available at https://github.com/EvidenceBench/EvidenceBench
Problem

Research questions and friction points this paper is trying to address.

Automatically finding evidence for biomedical hypotheses
Measuring model performance on evidence extraction
Creating scalable datasets for model training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel pipeline for biomedical evidence extraction
Hypothesis generation and sentence annotation
Scalable dataset with 100k annotated papers
πŸ”Ž Similar Papers
No similar papers found.
J
Jianyou Wang
Laboratory for Emerging Intelligence, University of California, San Diego
W
Weili Cao
Laboratory for Emerging Intelligence, University of California, San Diego
K
Kaicheng Wang
Laboratory for Emerging Intelligence, University of California, San Diego
X
Xiaoyue Wang
Laboratory for Emerging Intelligence, University of California, San Diego
A
Ashish Dalvi
Laboratory for Emerging Intelligence, University of California, San Diego
G
Gino Prasad
Laboratory for Emerging Intelligence, University of California, San Diego
Q
Qishan Liang
Department of Cellular and Molecular Medicine, University of California, San Diego
H
Hsuan-lin Her
Department of Cellular and Molecular Medicine, University of California, San Diego
M
Ming Wang
Sichuan Cancer Hospital & Institute
Q
Qin Yang
The Third People’s Hospital of Chengdu
G
Gene W. Yeo
Department of Cellular and Molecular Medicine, University of California, San Diego
D
David E. Neal
Elsevier
M
Maxim Khan
Elsevier
C
Christopher D. Rosin
Elsevier
R
Ramamohan Paturi
Laboratory for Emerging Intelligence, University of California, San Diego
Leon Bergen
Leon Bergen
Associate Professor, UCSD
Computational Linguistics