Rethinking Patient Education as Multi-turn Multi-modal Interaction

📅 2026-04-16

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

Current patient education systems are largely confined to plain text and struggle to deliver evidence-based, accessible, and emotionally attuned explanations that integrate medical imaging within multi-turn dialogues. This work proposes MedImageEdu—the first radiology-oriented, multi-turn, multimodal, evidence-driven benchmark for patient education—featuring a novel mechanism that couples hidden patient profiles with a drawing tool to enable image-grounded visual reasoning and empathetic responses. The approach integrates vision-language models, a multi-agent dialogue framework, and alignment-aware drawing instruction generation, alongside a five-dimensional evaluation protocol. Experiments on 150 clinical cases reveal that while existing models produce fluent language, they exhibit significant shortcomings in visual grounding, safety, and handling high-emotion dialogues.

Technology Category

Application Category

📝 Abstract

Most medical multimodal benchmarks focus on static tasks such as image question answering, report generation, and plain-language rewriting. Patient education is more demanding: systems must identify relevant evidence across images, show patients where to look, explain findings in accessible language, and handle confusion or distress. Yet most patient education work remains text-only, even though combined image-and-text explanations may better support understanding. We introduce MedImageEdu, a benchmark for multi-turn, evidence-grounded radiology patient education. Each case provides a radiology report with report text and case images. A DoctorAgent interacts with a PatientAgent, conditioned on a hidden profile that captures factors such as education level, health literacy, and personality. When a patient question would benefit from visual support, the DoctorAgent can issue drawing instructions grounded in the report, case images, and the current question to a benchmark-provided drawing tool. The tool returns image(s), after which the DoctorAgent produces a final multimodal response consisting of the image(s) and a grounded plain-language explanation. MedImageEdu contains 150 cases from three sources and evaluates both the consultation process and the final multimodal response along five dimensions: Consultation, Safety and Scope, Language Quality, Drawing Quality, and Image-Text Response Quality. Across representative open- and closed-source vision-language model agents, we find three consistent gaps: fluent language often outpaces faithful visual grounding, safety is the weakest dimension across disease categories, and emotionally tense interactions are harder than low education or low health literacy. MedImageEdu provides a controlled testbed for assessing whether multimodal agents can teach from evidence rather than merely answer from text.

Problem

Research questions and friction points this paper is trying to address.

patient education

multi-turn interaction

multi-modal interaction

visual grounding

radiology communication

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-turn interaction

multi-modal patient education

evidence-grounded explanation