🤖 AI Summary
This work addresses the challenge of evaluating factual accuracy in knowledge generation by large language models (LLMs) for biomedical associations—particularly disease–drug, disease–gene, and disease–symptom relationships. We propose a generative–verificative two-stage framework that decouples knowledge generation from rigorous validation. Our method introduces disease-centric prompt engineering and integrates biomedical ontologies—including DOID, ChEBI, and GO—to enable terminology standardization, ontology alignment, and semantic-level verification. Experiments show high accuracy in identifying disease (98%), drug (92%), and gene (88%) entities, with literature support rates of 89%–91%. In contrast, symptom identification accuracy is markedly lower (49%–61%), revealing for the first time a critical limitation of LLMs in symptom modeling. This study establishes an interpretable, reproducible, and quantitative paradigm for trustworthy biomedical knowledge generation and evaluation.
📝 Abstract
The generative capabilities of LLM models present opportunities in accelerating tasks and concerns with the authenticity of the knowledge it produces. To address the concerns, we present a computational approach that systematically evaluates the factual accuracy of biomedical knowledge that an LLM model has been prompted to generate. Our approach encompasses two processes: the generation of disease-centric associations and the verification of them using the semantic knowledge of the biomedical ontologies. Using ChatGPT as the select LLM model, we designed a set of prompt-engineering processes to generate linkages between diseases, drugs, symptoms, and genes to establish grounds for assessments. Experimental results demonstrate high accuracy in identifying disease terms (88%-97%), drug names (90%-91%), and genetic information (88%-98%). The symptom term identification accuracy was notably lower (49%-61%), as verified against the DOID, ChEBI, SYMPTOM, and GO ontologies accordingly. The verification of associations reveals literature coverage rates of (89%-91%) among disease-drug and disease-gene associations. The low identification accuracy for symptom terms also contributed to the verification of symptom-related associations (49%-62%).