Overview of the MedHopQA track at BioCreative IX: track description, participation and evaluation of systems for multi-hop medical question answering

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

183K/year
🤖 AI Summary
This work addresses the challenge of integrating multi-source biomedical information—such as diseases, genes, and chemicals, with a focus on rare diseases—in multi-hop question answering. To this end, the authors construct a new dataset comprising 1,000 two-hop question-answer pairs that require models to synthesize information from two distinct Wikipedia pages. They also organize an international evaluation campaign to assess large language models’ multi-hop reasoning capabilities. The study introduces MedCPT, a novel concept-level evaluation metric, and combines retrieval-augmented generation (RAG), zero-shot reasoning, Wikipedia-based knowledge retrieval, and concept embedding matching. The best-performing system achieves 89.30% MedCPT F1 and 87.30% exact match, substantially outperforming zero-shot baselines (67.40% and 60.20%, respectively), thereby demonstrating the efficacy of the proposed retrieval strategy and advancing paradigms in both answer evaluation and system design.
📝 Abstract
Multi-hop question answering (QA) remains a significant challenge in the biomedical domain, requiring systems to integrate information across multiple sources to answer complex questions. To address this problem, the BioCreative IX MedHopQA shared task was designed to benchmark in multi-hop reasoning for large language models (LLMs). We developed a novel dataset of 1,000 challenging QA pairs spanning diseases, genes, and chemicals, with particular emphasis on rare diseases. Each question was constructed to require two-hop reasoning through the integration of information from two distinct Wikipedia pages. The challenge attracted 48 submissions from 13 teams. Systems were evaluated using both surface string comparison and conceptual accuracy (MedCPT score). The results showed a substantial performance gap between baseline LLMs and enhanced systems. The top-ranked submission achieved an 89.30% F1 score on the MedCPT metric and an 87.30% exact match (EM) score, compared with 67.40% and 60.20%, respectively, for the zero-shot baseline. A central finding of the challenge was that retrieval-augmented generation (RAG) and related retrieval-based strategies were critical for strong performance. In addition, concept-level evaluation improved answer assessment when correct responses differed in surface form. The MedHopQA dataset is publicly available to support continued progress in this important area. Challenge materials: https://www.ncbi.nlm.nih.gov/research/bionlp/medhopqa and benchmark https://www.codabench.org/competitions/7609/
Problem

Research questions and friction points this paper is trying to address.

multi-hop question answering
biomedical domain
information integration
medical question answering
complex reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-hop reasoning
retrieval-augmented generation
biomedical question answering
concept-level evaluation
rare diseases
🔎 Similar Papers
No similar papers found.
Rezarta Islamaj
Rezarta Islamaj
National Library of Medicine, National Institutes of
Natural language processingtext miningmachine learningdata mining
J
Joey Chan
National Library of Medicine (NLM), National Institutes of Health (NIH), MD, 20894, Bethesda, USA; University of Illinois at Urbana Champaign
Robert Leaman
Robert Leaman
Staff Scientist, NCBI/NLM/NIH
Natural Language ProcessingMachine Learning
J
Jongmyung Jung
Korea University
H
Hyeongsoon Hwang
Korea University
Q
Quoc-An Nguyen
VNU University of Engineering and Technology, Hanoi, Vietnam
H
Hoang-Quynh Le
VNU University of Engineering and Technology, Hanoi, Vietnam
H
Harikrishnan Gurushankar Saisudha
Concordia University, Montreal, QC, CA
G
Ganesh Chandrasekar
Concordia University, Montreal, QC, CA
R
Rustam R. Taktashov
Institute of Biomedical Chemistry (IBMC), 10 bld. 8, Pogodinskaya str., 119121 Moscow, Russia
N
Nadezhda Yu. Bizyukova
Institute of Biomedical Chemistry (IBMC), 10 bld. 8, Pogodinskaya str., 119121 Moscow, Russia
S
Sofia I. R. Conceição
LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, 1749-016 Lisbon, Portugal
P
Paulo R. C. Lopes
LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, 1749-016 Lisbon, Portugal
R
Reem Abdel Salam
Faculty of Engineering, Computer Engineering Department Cairo University; CaresAI, Australia
M
Mary Adewunmi
Menzies School of Health Research, Charles Darwin University, NT, Australia; CaresAI, Australia
Zhiyong Lu
Zhiyong Lu
Senior Investigator, NLM; Adjunct Professor of CS, UIUC
BioNLPBiomedical InformaticsMedical AIArtificial Intelligence