RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Existing benchmarks struggle to evaluate large language models’ ability to provide fine-grained prescription recommendations at specific time points within dynamic clinical scenarios. To address this gap, this work introduces the first prescription-level evaluation paradigm grounded in time-series clinical trajectories, constructing a multiple-choice benchmark comprising drug–dose–route-of-administration triplets. Leveraging real-world electronic health records, the benchmark incorporates patient-specific distractors generated through chain-of-thought perturbation, encompassing 1,547 questions across 584 patients and 969 medications. Evaluation of 16 leading large language models reveals F1 scores ranging from 45.18% to 77.10%, with the best exact match accuracy at only 46.10%, underscoring the benchmark’s high difficulty and strong discriminative power.

📝 Abstract

Inpatient medication recommendation requires clinicians to repeatedly select specific medications, doses, and routes as a patient's condition evolves. Existing benchmarks formulate this task as admission-level prediction over coarse drug codes with multi-hot diagnostic and procedure code inputs, failing to capture the per-timepoint, information-rich nature of real prescribing. We propose RxEval, a prescription-level benchmark that evaluates LLM prescribing capability by multiple-choice questions: each question presents a detailed patient profile and time-ordered clinical trajectory, requiring selection of specific medication-dose-route triples from real prescriptions and patient-specific distractors generated via reasoning-chain perturbation. RxEval comprises 1,547 questions spanning 584 patients, 18 diagnostic categories, and 969 unique medications. Evaluation of 16 LLMs shows that RxEval is both challenging and discriminative: F1 ranges from 45.18 to 77.10 across models, and the best Exact Match is only 46.10%. Error analysis reveals that even frontier models may overlook stated patient information and fail to derive clinical conclusions.

Problem

Research questions and friction points this paper is trying to address.

medication recommendation

LLM evaluation

prescription-level benchmark

inpatient prescribing

clinical decision support

Innovation

Methods, ideas, or system contributions that make the work stand out.

prescription-level evaluation

medication recommendation

reasoning-chain perturbation