🤖 AI Summary
Existing benchmarks struggle to evaluate large language models’ ability to provide fine-grained prescription recommendations at specific time points within dynamic clinical scenarios. To address this gap, this work introduces the first prescription-level evaluation paradigm grounded in time-series clinical trajectories, constructing a multiple-choice benchmark comprising drug–dose–route-of-administration triplets. Leveraging real-world electronic health records, the benchmark incorporates patient-specific distractors generated through chain-of-thought perturbation, encompassing 1,547 questions across 584 patients and 969 medications. Evaluation of 16 leading large language models reveals F1 scores ranging from 45.18% to 77.10%, with the best exact match accuracy at only 46.10%, underscoring the benchmark’s high difficulty and strong discriminative power.
📝 Abstract
Inpatient medication recommendation requires clinicians to repeatedly select specific medications, doses, and routes as a patient's condition evolves. Existing benchmarks formulate this task as admission-level prediction over coarse drug codes with multi-hot diagnostic and procedure code inputs, failing to capture the per-timepoint, information-rich nature of real prescribing. We propose RxEval, a prescription-level benchmark that evaluates LLM prescribing capability by multiple-choice questions: each question presents a detailed patient profile and time-ordered clinical trajectory, requiring selection of specific medication-dose-route triples from real prescriptions and patient-specific distractors generated via reasoning-chain perturbation. RxEval comprises 1,547 questions spanning 584 patients, 18 diagnostic categories, and 969 unique medications. Evaluation of 16 LLMs shows that RxEval is both challenging and discriminative: F1 ranges from 45.18 to 77.10 across models, and the best Exact Match is only 46.10%. Error analysis reveals that even frontier models may overlook stated patient information and fail to derive clinical conclusions.