Large language models management of medications: three performance analyses

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

This study addresses the limited understanding of general-purpose large language models’ (LLMs) reliability in clinical pharmacotherapy. We conduct the first systematic evaluation of GPT-4o on three critical drug management tasks: dosage form matching, drug–drug interaction (DDI) identification, and prescription instruction generation. Methodologically, we employ TF-IDF cosine similarity, normalized Levenshtein distance, ROUGE-1/L F1 scores, and rigorous clinical expert adjudication, augmented by web-search-based knowledge verification. Results reveal suboptimal performance: only 49% accuracy in dosage form matching, substantial inconsistency in DDI judgment, and a 65.8% error-free rate for prescription sentences. These findings underscore the insufficient clinical fidelity of current foundation LLMs in domain-specific medical reasoning. The study contributes empirical evidence highlighting the urgent need for clinical-domain fine-tuning and standardized, decision-oriented evaluation frameworks—thereby informing the development of trustworthy, clinically deployable LLMs.

Technology Category

Application Category

📝 Abstract

Background: Large language models (LLMs) can be useful in diagnosing medical conditions, but few studies have evaluated their consistency in recommending appropriate medication regimens. The purpose of this evaluation was to test GPT-4o on three medication benchmarking tests including mapping a drug name to its correct formulation, identifying drug-drug interactions using both its internal knowledge and using a web search, and preparing a medication order sentence after being given the medication name. Methods: Using GTP-4o three experiments were completed. Accuracy was quantified by computing cosine similarity on TF-IDF vectors, normalized Levenshtein similarity, and ROUGE-1/ROUGE-L F1 between each response and its reference string or by manual evaluation by clinicians. Results: GPT-4o performed poorly on drug-formulation matching, with frequent omissions of available drug formulations (mean 1.23 per medication) and hallucinations of formulations that do not exist (mean 1.14 per medication). Only 49% of tested medications were correctly matched to all available formulations. Accuracy was decreased for medications with more formulations (p<0.0001). GPT-4o was also inconsistent at identifying drug-drug-interactions, although it had better performance with the search-augmented assessment compared to its internal knowledge (54.7% vs. 69.2%, p=0.013). However, allowing a web-search worsened performance when there was no drug-drug interaction (median % correct 100% vs. 40%, p<0.001). Finally, GPT-4o performed moderately with preparing a medication order sentence, with only 65.8% of medication order sentences containing no medication or abbreviation errors. Conclusions: Model performance was overall poor for all tests. This highlights the need for domain-specific training through clinician-annotated datasets and a comprehensive evaluation framework for benchmarking performance.

Problem

Research questions and friction points this paper is trying to address.

Evaluating GPT-4o's accuracy in matching drug names to correct formulations

Testing GPT-4o's consistency in identifying potential drug-drug interactions

Assessing GPT-4o's ability to prepare accurate medication order sentences

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated GPT-4o on medication benchmarking tests

Used cosine similarity and Levenshtein similarity metrics

Applied web-search augmentation for drug interaction assessment

🔎 Similar Papers

Design and Evaluation of a CDSS for Drug Allergy Management Using LLMs and Pharmaceutical Data Integration