🤖 AI Summary
Current AI-based clinical summarization models overemphasize biomedical information while neglecting patients’ values, preferences, and concerns—undermining patient-centered care. To address this gap, we propose the first patient-centered clinical dialogue summarization benchmark, integrating dual perspectives (patient and clinician) via a mixed-methods framework and a high-quality annotation protocol. Leveraging open-weight LLMs—including Llama-3.1-8B and Mistral-8B—we employ zero-shot and few-shot prompting, and rigorously evaluate outputs using ROUGE-L, BERTScore, and qualitative expert assessment. Results show that the best-performing model achieves near-expert-level completeness and fluency, with significantly improved expression of patient-centered elements; however, factual accuracy and capture of subjective aspects remain limited. This work establishes both a theoretical foundation and an implementation paradigm for trustworthy, human-centered AI in clinical summarization.
📝 Abstract
Large Language Models (LLMs) are increasingly demonstrating the potential to reach human-level performance in generating clinical summaries from patient-clinician conversations. However, these summaries often focus on patients' biology rather than their preferences, values, wishes, and concerns. To achieve patient-centered care, we propose a new standard for Artificial Intelligence (AI) clinical summarization tasks: Patient-Centered Summaries (PCS). Our objective was to develop a framework to generate PCS that capture patient values and ensure clinical utility and to assess whether current open-source LLMs can achieve human-level performance in this task. We used a mixed-methods process. Two Patient and Public Involvement groups (10 patients and 8 clinicians) in the United Kingdom participated in semi-structured interviews exploring what personal and contextual information should be included in clinical summaries and how it should be structured for clinical use. Findings informed annotation guidelines used by eight clinicians to create gold-standard PCS from 88 atrial fibrillation consultations. Sixteen consultations were used to refine a prompt aligned with the guidelines. Five open-source LLMs (Llama-3.2-3B, Llama-3.1-8B, Mistral-8B, Gemma-3-4B, and Qwen3-8B) generated summaries for 72 consultations using zero-shot and few-shot prompting, evaluated with ROUGE-L, BERTScore, and qualitative metrics. Patients emphasized lifestyle routines, social support, recent stressors, and care values. Clinicians sought concise functional, psychosocial, and emotional context. The best zero-shot performance was achieved by Mistral-8B (ROUGE-L 0.189) and Llama-3.1-8B (BERTScore 0.673); the best few-shot by Llama-3.1-8B (ROUGE-L 0.206, BERTScore 0.683). Completeness and fluency were similar between experts and models, while correctness and patient-centeredness favored human PCS.