Large language models management of medications: three performance analyses

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the limited understanding of general-purpose large language models’ (LLMs) reliability in clinical pharmacotherapy. We conduct the first systematic evaluation of GPT-4o on three critical drug management tasks: dosage form matching, drug–drug interaction (DDI) identification, and prescription instruction generation. Methodologically, we employ TF-IDF cosine similarity, normalized Levenshtein distance, ROUGE-1/L F1 scores, and rigorous clinical expert adjudication, augmented by web-search-based knowledge verification. Results reveal suboptimal performance: only 49% accuracy in dosage form matching, substantial inconsistency in DDI judgment, and a 65.8% error-free rate for prescription sentences. These findings underscore the insufficient clinical fidelity of current foundation LLMs in domain-specific medical reasoning. The study contributes empirical evidence highlighting the urgent need for clinical-domain fine-tuning and standardized, decision-oriented evaluation frameworks—thereby informing the development of trustworthy, clinically deployable LLMs.

Technology Category

Application Category

📝 Abstract
Background: Large language models (LLMs) can be useful in diagnosing medical conditions, but few studies have evaluated their consistency in recommending appropriate medication regimens. The purpose of this evaluation was to test GPT-4o on three medication benchmarking tests including mapping a drug name to its correct formulation, identifying drug-drug interactions using both its internal knowledge and using a web search, and preparing a medication order sentence after being given the medication name. Methods: Using GTP-4o three experiments were completed. Accuracy was quantified by computing cosine similarity on TF-IDF vectors, normalized Levenshtein similarity, and ROUGE-1/ROUGE-L F1 between each response and its reference string or by manual evaluation by clinicians. Results: GPT-4o performed poorly on drug-formulation matching, with frequent omissions of available drug formulations (mean 1.23 per medication) and hallucinations of formulations that do not exist (mean 1.14 per medication). Only 49% of tested medications were correctly matched to all available formulations. Accuracy was decreased for medications with more formulations (p<0.0001). GPT-4o was also inconsistent at identifying drug-drug-interactions, although it had better performance with the search-augmented assessment compared to its internal knowledge (54.7% vs. 69.2%, p=0.013). However, allowing a web-search worsened performance when there was no drug-drug interaction (median % correct 100% vs. 40%, p<0.001). Finally, GPT-4o performed moderately with preparing a medication order sentence, with only 65.8% of medication order sentences containing no medication or abbreviation errors. Conclusions: Model performance was overall poor for all tests. This highlights the need for domain-specific training through clinician-annotated datasets and a comprehensive evaluation framework for benchmarking performance.
Problem

Research questions and friction points this paper is trying to address.

Evaluating GPT-4o's accuracy in matching drug names to correct formulations
Testing GPT-4o's consistency in identifying potential drug-drug interactions
Assessing GPT-4o's ability to prepare accurate medication order sentences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated GPT-4o on medication benchmarking tests
Used cosine similarity and Levenshtein similarity metrics
Applied web-search augmentation for drug interaction assessment
🔎 Similar Papers
No similar papers found.
K
Kelli Henry
University of Colorado Skaggs School of Medicine, Aurora, CO, USA
Steven Xu
Steven Xu
Department of Computer Science, University of Georgia, Athens, GA
K
Kaitlin Blotske
University of Colorado Skaggs School of Medicine, Aurora, CO, USA
M
Moriah Cargile
University of Colorado Skaggs School of Pharmacy, Aurora, CO, USA
E
Erin F. Barreto
Mayo Clinic, Rochester, MN, USA
B
Brian Murray
University of Colorado Skaggs School of Pharmacy, Aurora, CO, USA
Susan Smith
Susan Smith
University of Georgia College of Pharmacy, Athens, GA, USA
S
Seth R. Bauer
Cleveland Clinic, Department of Pharmacy, Cleveland, OH, USA
Yanjun Gao
Yanjun Gao
University of Colorado; University of Wisconsin Madison
Natural Language ProcessingArtificial IntelligenceHealth InformaticsEducation Technology
Tianming Liu
Tianming Liu
Distinguished Research Professor of Computer Science, University of Georgia
BrainBrain-Inspired AILLMArtificial General IntelligenceQuantum AI
Andrea Sikora
Andrea Sikora
Clinical Associate Professor, The University of Georgia College of Pharmacy
@AndreaSikorapharmacycritical carecardiologyacute respiratory failure