🤖 AI Summary
Phenotype term standardization commonly relies on explicit lexical definitions, resulting in high computational overhead and poor generalizability. Method: This paper proposes a lightweight, definition-free retrieval-augmented approach: context-aware word embeddings are generated using BioBERT; candidate matches are retrieved via vector similarity search against the Human Phenotype Ontology (HPO); and final standardization is performed by prompting a large language model (LLM). Contribution/Results: To our knowledge, this is the first method to completely eliminate dependence on term definitions, significantly reducing retrieval complexity and enabling direct transfer to other biomedical terminology standardization tasks. Evaluated on the OMIM clinical summary dataset, LLM-based standardization accuracy improves from 62.3% to 90.3%, corresponding to a 75% relative reduction in error rate—demonstrating both effectiveness and practical utility.
📝 Abstract
Large language models (LLMs) have shown improved accuracy in phenotype term normalization tasks when augmented with retrievers that suggest candidate normalizations based on term definitions. In this work, we introduce a simplified retriever that enhances LLM accuracy by searching the Human Phenotype Ontology (HPO) for candidate matches using contextual word embeddings from BioBERT without the need for explicit term definitions. Testing this method on terms derived from the clinical synopses of Online Mendelian Inheritance in Man (OMIM), we demonstrate that the normalization accuracy of a state-of-the-art LLM increases from a baseline of 62.3% without augmentation to 90.3% with retriever augmentation. This approach is potentially generalizable to other biomedical term normalization tasks and offers an efficient alternative to more complex retrieval methods.