Ontology-Guided Query Expansion for Biomedical Document Retrieval using Large Language Models

📅 2025-08-15

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

To address retrieval challenges in biomedical question answering caused by lexical diversity and semantic ambiguity of domain-specific terminology, this paper proposes an unsupervised, semantics-driven query expansion method. It integrates structured ontological knowledge—definitions and semantic relationships—from the UMLS Metathesaurus into large language models (LLMs) to enable controllable, low-hallucination query rewriting and expansion, seamlessly supporting both sparse and dense retrievers. Its key innovation lies in the first explicit incorporation of a canonical biomedical ontology into an LLM-based query expansion framework, balancing factual accuracy with generative flexibility. Evaluated on NFCorpus, TREC-COVID, and SciFact, the method achieves up to a 22.1% improvement in NDCG@10 over sparse baselines and a 6.5% gain over the strongest baseline; it also improves robustness to perturbed queries by 15.7%. Additionally, we publicly release a restructured biomedical QA benchmark dataset.

Technology Category

Application Category

📝 Abstract

Effective Question Answering (QA) on large biomedical document collections requires effective document retrieval techniques. The latter remains a challenging task due to the domain-specific vocabulary and semantic ambiguity in user queries. We propose BMQExpander, a novel ontology-aware query expansion pipeline that combines medical knowledge - definitions and relationships - from the UMLS Metathesaurus with the generative capabilities of large language models (LLMs) to enhance retrieval effectiveness. We implemented several state-of-the-art baselines, including sparse and dense retrievers, query expansion methods, and biomedical-specific solutions. We show that BMQExpander has superior retrieval performance on three popular biomedical Information Retrieval (IR) benchmarks: NFCorpus, TREC-COVID, and SciFact - with improvements of up to 22.1% in NDCG@10 over sparse baselines and up to 6.5% over the strongest baseline. Further, BMQExpander generalizes robustly under query perturbation settings, in contrast to supervised baselines, achieving up to 15.7% improvement over the strongest baseline. As a side contribution, we publish our paraphrased benchmarks. Finally, our qualitative analysis shows that BMQExpander has fewer hallucinations compared to other LLM-based query expansion baselines.

Problem

Research questions and friction points this paper is trying to address.

Enhancing biomedical document retrieval effectiveness using ontology and LLMs

Addressing domain-specific vocabulary and semantic ambiguity in queries

Improving retrieval performance on biomedical IR benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Ontology-aware query expansion pipeline

Combines UMLS knowledge with LLMs

Enhances biomedical document retrieval performance

🔎 Similar Papers

A Simplified Retriever to Improve Accuracy of Phenotype Normalizations by Large Language Models