Ontology-Guided Query Expansion for Biomedical Document Retrieval using Large Language Models

📅 2025-08-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address retrieval challenges in biomedical question answering caused by lexical diversity and semantic ambiguity of domain-specific terminology, this paper proposes an unsupervised, semantics-driven query expansion method. It integrates structured ontological knowledge—definitions and semantic relationships—from the UMLS Metathesaurus into large language models (LLMs) to enable controllable, low-hallucination query rewriting and expansion, seamlessly supporting both sparse and dense retrievers. Its key innovation lies in the first explicit incorporation of a canonical biomedical ontology into an LLM-based query expansion framework, balancing factual accuracy with generative flexibility. Evaluated on NFCorpus, TREC-COVID, and SciFact, the method achieves up to a 22.1% improvement in NDCG@10 over sparse baselines and a 6.5% gain over the strongest baseline; it also improves robustness to perturbed queries by 15.7%. Additionally, we publicly release a restructured biomedical QA benchmark dataset.

Technology Category

Application Category

📝 Abstract
Effective Question Answering (QA) on large biomedical document collections requires effective document retrieval techniques. The latter remains a challenging task due to the domain-specific vocabulary and semantic ambiguity in user queries. We propose BMQExpander, a novel ontology-aware query expansion pipeline that combines medical knowledge - definitions and relationships - from the UMLS Metathesaurus with the generative capabilities of large language models (LLMs) to enhance retrieval effectiveness. We implemented several state-of-the-art baselines, including sparse and dense retrievers, query expansion methods, and biomedical-specific solutions. We show that BMQExpander has superior retrieval performance on three popular biomedical Information Retrieval (IR) benchmarks: NFCorpus, TREC-COVID, and SciFact - with improvements of up to 22.1% in NDCG@10 over sparse baselines and up to 6.5% over the strongest baseline. Further, BMQExpander generalizes robustly under query perturbation settings, in contrast to supervised baselines, achieving up to 15.7% improvement over the strongest baseline. As a side contribution, we publish our paraphrased benchmarks. Finally, our qualitative analysis shows that BMQExpander has fewer hallucinations compared to other LLM-based query expansion baselines.
Problem

Research questions and friction points this paper is trying to address.

Enhancing biomedical document retrieval effectiveness using ontology and LLMs
Addressing domain-specific vocabulary and semantic ambiguity in queries
Improving retrieval performance on biomedical IR benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Ontology-aware query expansion pipeline
Combines UMLS knowledge with LLMs
Enhances biomedical document retrieval performance
🔎 Similar Papers
No similar papers found.
Zabir Al Nazi
Zabir Al Nazi
University of California Riverside
Information RetrievalMachine LearningMedical AI
V
Vagelis Hristidis
University of California Riverside, Riverside, California, USA
A
Aaron Lawson McLean
Friedrich Schiller University Jena, Jena, Germany
J
Jannat Ara Meem
University of California Riverside, Riverside, California, USA
M
Md Taukir Azam Chowdhury
University of California Riverside, Riverside, California, USA