🤖 AI Summary
This study investigates the distinct mechanisms of generalization versus memorization in large language models (LLMs) for biomedical term normalization—i.e., mapping lexical terms to standardized ontology identifiers. Addressing heterogeneity across ontologies (e.g., Gene Ontology [GO], Human Phenotype Ontology [HPO], gene–protein mappings) in lexical structure and identifier popularity, we propose an interpretable fine-tuning framework leveraging Llama 3.1-8B and GPT-4o. We conduct fine-grained evaluation across multi-source ontology data and disentangle semantic generalization from rote memorization via embedding space analysis. Key findings: identifier popularity and term lexicalization strongly modulate fine-tuning efficacy; GO and gene–protein mappings achieve 77% accuracy and a 13.9% generalization gain, respectively, whereas low-lexicalization ontologies like HPO show limited improvement. Our work establishes theoretical foundations and practical guidelines for controllable LLM deployment in biomedical knowledge standardization.
📝 Abstract
Effective biomedical data integration depends on automated term normalization, the mapping of natural language biomedical terms to standardized identifiers. This linking of terms to identifiers is essential for semantic interoperability. Large language models (LLMs) show promise for this task but perform unevenly across terminologies. We evaluated both memorization (training-term performance) and generalization (validation-term performance) across multiple biomedical ontologies. Fine-tuning Llama 3.1 8B revealed marked differences by terminology. GO mappings showed strong memorization gains (up to 77% improvement in term-to-identifier accuracy), whereas HPO showed minimal improvement. Generalization occurred only for protein-gene (GENE) mappings (13.9% gain), while fine-tuning for HPO and GO yielded negligible transfer. Baseline accuracy varied by model scale, with GPT-4o outperforming both Llama variants for all terminologies. Embedding analyses showed tight semantic alignment between gene symbols and protein names but weak alignment between terms and identifiers for GO or HPO, consistent with limited lexicalization. Fine-tuning success depended on two interacting factors: identifier popularity and lexicalization. Popular identifiers were more likely encountered during pretraining, enhancing memorization. Lexicalized identifiers, such as gene symbols, enabled semantic generalization. By contrast, arbitrary identifiers in GO and HPO constrained models to rote learning. These findings provide a predictive framework for when fine-tuning enhances factual recall versus when it fails due to sparse or non-lexicalized identifiers.