🤖 AI Summary
Existing LLM benchmarks predominantly focus on in-hospital diagnostic reasoning, neglecting post-discharge patient education—a critical component of care continuity. Method: We introduce the first systematic benchmark for discharge communication, featuring multi-turn, personalized dialogues between DoctorAgent and PatientAgent to simulate diverse clinical scenarios and patient profiles. Our evaluation framework incorporates structured health document generation, AHRQ guideline compliance checking, LLM-as-judge assessment, and multiple-choice comprehension testing to quantify performance across multiple dimensions. Contribution/Results: Experiments on 18 state-of-the-art LLMs reveal no significant positive correlation between model scale and educational effectiveness, exposing a fundamental trade-off between content prioritization and communicative strategy application. These findings highlight structural limitations of current LLMs in delivering personalized, clinically appropriate discharge instructions—underscoring the need for task-specific architectural and training innovations.
📝 Abstract
Discharge communication is a critical yet underexplored component of patient care, where the goal shifts from diagnosis to education. While recent large language model (LLM) benchmarks emphasize in-visit diagnostic reasoning, they fail to evaluate models' ability to support patients after the visit. We introduce DischargeSim, a novel benchmark that evaluates LLMs on their ability to act as personalized discharge educators. DischargeSim simulates post-visit, multi-turn conversations between LLM-driven DoctorAgents and PatientAgents with diverse psychosocial profiles (e.g., health literacy, education, emotion). Interactions are structured across six clinically grounded discharge topics and assessed along three axes: (1) dialogue quality via automatic and LLM-as-judge evaluation, (2) personalized document generation including free-text summaries and structured AHRQ checklists, and (3) patient comprehension through a downstream multiple-choice exam. Experiments across 18 LLMs reveal significant gaps in discharge education capability, with performance varying widely across patient profiles. Notably, model size does not always yield better education outcomes, highlighting trade-offs in strategy use and content prioritization. DischargeSim offers a first step toward benchmarking LLMs in post-visit clinical education and promoting equitable, personalized patient support.