EvidenceBench: A Benchmark for Extracting Evidence from Biomedical Papers

📅 2025-04-25

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This paper addresses the challenge of automatically extracting evidence required for hypothesis verification in biomedical literature by introducing EvidenceBench, the first benchmark dedicated to hypothesis-verification evidence extraction. Methodologically, it proposes an expert-driven end-to-end annotation pipeline: domain-knowledge-informed generation of verifiable hypotheses, followed by multi-round expert consensus annotation to label each sentence as supporting or refuting evidence. The resulting large-scale training set, EvidenceBench-100k, comprises fully annotated full texts from 107,000 biomedical papers. Key contributions include (i) an annotation paradigm strictly grounded in expert judgment, (ii) a domain-aware hypothesis generation strategy, and (iii) a novel evaluation framework integrating language models with retrieval systems. Experiments reveal a substantial performance gap between current models and human experts. Both datasets are publicly released, establishing foundational resources for hypothesis-driven AI research in biomedicine.

Technology Category

Application Category

📝 Abstract

We study the task of automatically finding evidence relevant to hypotheses in biomedical papers. Finding relevant evidence is an important step when researchers investigate scientific hypotheses. We introduce EvidenceBench to measure models performance on this task, which is created by a novel pipeline that consists of hypothesis generation and sentence-by-sentence annotation of biomedical papers for relevant evidence, completely guided by and faithfully following existing human experts judgment. We demonstrate the pipeline's validity and accuracy with multiple sets of human-expert annotations. We evaluated a diverse set of language models and retrieval systems on the benchmark and found that model performances still fall significantly short of the expert level on this task. To show the scalability of our proposed pipeline, we create a larger EvidenceBench-100k with 107,461 fully annotated papers with hypotheses to facilitate model training and development. Both datasets are available at https://github.com/EvidenceBench/EvidenceBench

Problem

Research questions and friction points this paper is trying to address.

Automatically finding evidence for biomedical hypotheses

Measuring model performance on evidence extraction

Creating scalable datasets for model training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel pipeline for biomedical evidence extraction

Hypothesis generation and sentence annotation

Scalable dataset with 100k annotated papers

🔎 Similar Papers

No similar papers found.