RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation

📅 2026-05-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

162K/year
🤖 AI Summary
Existing benchmarks struggle to evaluate large language models’ ability to provide fine-grained prescription recommendations at specific time points within dynamic clinical scenarios. To address this gap, this work introduces the first prescription-level evaluation paradigm grounded in time-series clinical trajectories, constructing a multiple-choice benchmark comprising drug–dose–route-of-administration triplets. Leveraging real-world electronic health records, the benchmark incorporates patient-specific distractors generated through chain-of-thought perturbation, encompassing 1,547 questions across 584 patients and 969 medications. Evaluation of 16 leading large language models reveals F1 scores ranging from 45.18% to 77.10%, with the best exact match accuracy at only 46.10%, underscoring the benchmark’s high difficulty and strong discriminative power.
📝 Abstract
Inpatient medication recommendation requires clinicians to repeatedly select specific medications, doses, and routes as a patient's condition evolves. Existing benchmarks formulate this task as admission-level prediction over coarse drug codes with multi-hot diagnostic and procedure code inputs, failing to capture the per-timepoint, information-rich nature of real prescribing. We propose RxEval, a prescription-level benchmark that evaluates LLM prescribing capability by multiple-choice questions: each question presents a detailed patient profile and time-ordered clinical trajectory, requiring selection of specific medication-dose-route triples from real prescriptions and patient-specific distractors generated via reasoning-chain perturbation. RxEval comprises 1,547 questions spanning 584 patients, 18 diagnostic categories, and 969 unique medications. Evaluation of 16 LLMs shows that RxEval is both challenging and discriminative: F1 ranges from 45.18 to 77.10 across models, and the best Exact Match is only 46.10%. Error analysis reveals that even frontier models may overlook stated patient information and fail to derive clinical conclusions.
Problem

Research questions and friction points this paper is trying to address.

medication recommendation
LLM evaluation
prescription-level benchmark
inpatient prescribing
clinical decision support
Innovation

Methods, ideas, or system contributions that make the work stand out.

prescription-level evaluation
medication recommendation
reasoning-chain perturbation
clinical trajectory modeling
LLM benchmarking
🔎 Similar Papers