MedHopQA: A Disease-Centered Multi-Hop Reasoning Benchmark and Evaluation Framework for LLM-Based Biomedical Question Answering

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

181K/year
🤖 AI Summary
Existing biomedical question-answering benchmarks struggle to effectively evaluate the multi-hop reasoning capabilities of large language models, often suffering from answer leakage, performance saturation, and data contamination. To address these limitations, this work introduces a disease-centric multi-hop QA benchmark that requires models to synthesize information from two Wikipedia articles to generate open-ended answers. The authors propose a reusable dataset construction framework featuring a novel structured validation pipeline that combines expert annotation with LLM-as-a-judge evaluation. Semantic enrichment and concept-level assessment are enabled through integration of the MONDO, NCBI Gene, and NCBI Taxonomy ontologies, while a large-scale question bank with hidden answers enhances robustness against data contamination. The released benchmark comprises 1,000 high-quality questions (embedded within a publicly available set of 10,000) and supports the BioCreative IX shared task, significantly improving discriminative power for assessing genuine model reasoning abilities.
📝 Abstract
Evaluating large language models (LLMs) in the biomedical domain requires benchmarks that can distinguish reasoning from pattern matching and remain discriminative as model capabilities improve. Existing biomedical question answering (QA) benchmarks are limited in this respect. Multiple-choice formats can allow models to succeed through answer elimination rather than inference, while widely circulated exam-style datasets are increasingly vulnerable to performance saturation and training data contamination. Multi-hop reasoning, defined as the ability to integrate information across multiple sources to derive an answer, is central to clinically meaningful tasks such as diagnostic support, literature-based discovery, and hypothesis generation, yet remains underrepresented in current biomedical QA benchmarks. MedHopQA is a disease-centered multi-hop reasoning benchmark consisting of 1,000 expert-curated question-answer pairs introduced as a shared task at BioCreative IX. Each question requires synthesis of information across two distinct Wikipedia articles, and answers are provided in an open-ended free-text format. Gold annotations are augmented with ontology-grounded synonym sets from MONDO, NCBI Gene, and NCBI Taxonomy to support both lexical and concept-level evaluation. MedHopQA was constructed through a structured process combining human annotation, triage, iterative verification, and LLM-as-a-judge validation. To reduce leaderboard gaming and contamination risk, the 1,000 scored questions are embedded within a publicly downloadable set of 10,000 questions, with answers withheld, on a CodaBench leaderboard. MedHopQA provides both a benchmark and a reusable framework for constructing future biomedical QA datasets that prioritize compositional reasoning, saturation resistance, and contamination resistance as core design constraints.
Problem

Research questions and friction points this paper is trying to address.

biomedical question answering
multi-hop reasoning
large language models
benchmark evaluation
reasoning vs pattern matching
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-hop reasoning
biomedical question answering
ontology-grounded evaluation
contamination-resistant benchmark
LLM-as-a-judge
🔎 Similar Papers
No similar papers found.
Rezarta Islamaj
Rezarta Islamaj
National Library of Medicine, National Institutes of
Natural language processingtext miningmachine learningdata mining
Robert Leaman
Robert Leaman
Staff Scientist, NCBI/NLM/NIH
Natural Language ProcessingMachine Learning
J
Joey Chan
University of Illinois at Urbana-Champaign, Department of Computer Science, Urbana, IL, US
N
Nicholas Wan
University of Michigan Medical School, Ann Arbor, Michigan, US
Q
Qiao Jin
National Library of Medicine, Division of Intramural Research, Bethesda, MD, US
N
Natalie Xie
National Library of Medicine, Division of Intramural Research, Bethesda, MD, US
J
John Wilbur
National Library of Medicine, Division of Intramural Research, Bethesda, MD, US
S
Shubo Tian
National Library of Medicine, Division of Intramural Research, Bethesda, MD, US
Lana Yeganova
Lana Yeganova
Scientist, National Institutes of Health
Machine LearningText MiningArtificial Intelligence
Po-Ting Lai
Po-Ting Lai
National Center for Biotechnology Information (NCBI)
Natural language processingdeep learningtext mining
C
Chih-Hsuan Wei
National Library of Medicine, Division of Intramural Research, Bethesda, MD, US
Yifan Yang
Yifan Yang
NCBI, NLM, NIH | University of Maryland, College Park
Yao Ge
Yao Ge
National Institutes of Health (NIH)
Natural Language ProcessingInformation ExtractionBiomedical Informatics
Qingqing Zhu
Qingqing Zhu
nih
Zhizheng Wang
Zhizheng Wang
Postdoc, Division of Intramural Research (DIR), NLM, NIH
Large Language ModelsRepresentation LearningGraph Data MiningBioinformatics
Zhiyong Lu
Zhiyong Lu
Senior Investigator, NLM; Adjunct Professor of CS, UIUC
BioNLPBiomedical InformaticsMedical AIArtificial Intelligence