PatientVLM Meets DocVLM: Pre-Consultation Dialogue Between Vision-Language Models for Efficient Diagnosis

📅 2026-01-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of current medical AI diagnosis systems, which predominantly rely on imaging data while neglecting patient-reported symptoms, thereby constraining diagnostic accuracy. To bridge this gap, the authors propose a pre-consultation dialogue framework that synergistically integrates two vision–language models—DocVLM and PatientVLM—to simulate authentic physician–patient interactions through multi-turn visual–linguistic exchanges, effectively fusing radiological images with symptom information. The approach introduces, for the first time, a dual-VLM conversational mechanism that automatically generates symptom descriptions exhibiting high clinical fidelity and coverage, which are then used to construct synthetic consultation data for supervised fine-tuning of diagnostic models. Clinical evaluations confirm the superior quality of the generated symptoms, and diagnostic models fine-tuned within this framework significantly outperform image-only baselines in diagnostic performance.

Technology Category

Application Category

📝 Abstract
Traditionally, AI research in medical diagnosis has largely centered on image analysis. While this has led to notable advancements, the absence of patient-reported symptoms continues to hinder diagnostic accuracy. To address this, we propose a Pre-Consultation Dialogue Framework (PCDF) that mimics real-world diagnostic procedures, where doctors iteratively query patients before reaching a conclusion. Specifically, we simulate diagnostic dialogues between two vision-language models (VLMs): a DocVLM, which generates follow-up questions based on the image and dialogue history, and a PatientVLM, which responds using a symptom profile derived from the ground-truth diagnosis. We additionally conducted a small-scale clinical validation of the synthetic symptoms generated by our framework, with licensed clinicians confirming their clinical relevance, symptom coverage, and overall realism. These findings indicate that the resulting DocVLM-PatientVLM interactions form coherent, multi-turn consultations paired with images and diagnoses, which we then use to fine-tune the DocVLM. This dialogue-based supervision leads to substantial gains over image-only training, highlighting the value of realistic symptom elicitation for diagnosis.
Problem

Research questions and friction points this paper is trying to address.

medical diagnosis
vision-language models
patient symptoms
diagnostic accuracy
pre-consultation dialogue
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pre-Consultation Dialogue
Vision-Language Models
Symptom Elicitation
Medical Diagnosis
Synthetic Patient Interaction
🔎 Similar Papers
No similar papers found.
K
K. Lokesh
Indian Institute of Technology Jodhpur
A
A. S. Penamakuri
Indian Institute of Technology Jodhpur
U
Uday Agarwal
Indian Institute of Technology Jodhpur
A
A. Challa
All India Institute of Medical Sciences New Delhi
S
Shreya K. Gowda
All India Institute of Medical Sciences New Delhi
S
Somesh Gupta
All India Institute of Medical Sciences New Delhi
Anand Mishra
Anand Mishra
IIT Jodhpur
Computer VisionMachine Learning