RxSafeBench: Identifying Medication Safety Issues of Large Language Models in Simulated Consultation

📅 2025-11-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language models (LLMs) lack rigorous evaluation in medication safety due to scarcity of real-world clinical data and absence of dedicated benchmarks. To address this, we introduce RxSafeBench—the first comprehensive, drug-safety-focused evaluation benchmark. Our method features: (1) a simulated clinical consultation framework generating high-quality, risk-aware dialogues; (2) RxRisk DB, a knowledge base encompassing tens of thousands of domain-specific facts on contraindications and drug–drug interactions; and (3) a two-stage filtering pipeline ensuring clinical fidelity and pharmacological accuracy. The final benchmark comprises 2,443 diverse, clinically grounded scenarios. Empirical evaluation reveals that state-of-the-art LLMs exhibit substantial deficiencies in integrating contraindication and drug-interaction knowledge—particularly under implicit risk conditions. This work establishes a novel, systematic evaluation paradigm and provides critical infrastructure for advancing LLMs’ medication safety capabilities.

Technology Category

Application Category

📝 Abstract
Numerous medical systems powered by Large Language Models (LLMs) have achieved remarkable progress in diverse healthcare tasks. However, research on their medication safety remains limited due to the lack of real world datasets, constrained by privacy and accessibility issues. Moreover, evaluation of LLMs in realistic clinical consultation settings, particularly regarding medication safety, is still underexplored. To address these gaps, we propose a framework that simulates and evaluates clinical consultations to systematically assess the medication safety capabilities of LLMs. Within this framework, we generate inquiry diagnosis dialogues with embedded medication risks and construct a dedicated medication safety database, RxRisk DB, containing 6,725 contraindications, 28,781 drug interactions, and 14,906 indication-drug pairs. A two-stage filtering strategy ensures clinical realism and professional quality, resulting in the benchmark RxSafeBench with 2,443 high-quality consultation scenarios. We evaluate leading open-source and proprietary LLMs using structured multiple choice questions that test their ability to recommend safe medications under simulated patient contexts. Results show that current LLMs struggle to integrate contraindication and interaction knowledge, especially when risks are implied rather than explicit. Our findings highlight key challenges in ensuring medication safety in LLM-based systems and provide insights into improving reliability through better prompting and task-specific tuning. RxSafeBench offers the first comprehensive benchmark for evaluating medication safety in LLMs, advancing safer and more trustworthy AI-driven clinical decision support.
Problem

Research questions and friction points this paper is trying to address.

Evaluating medication safety of LLMs in clinical consultations lacking real datasets
Assessing LLMs' ability to recommend safe medications with contraindication knowledge
Identifying challenges in integrating drug interaction knowledge within LLM systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Simulates clinical consultations to assess medication safety
Constructs RxRisk DB with contraindications and drug interactions
Uses two-stage filtering for realistic consultation scenarios
🔎 Similar Papers
No similar papers found.