DrVD-Bench: Do Vision-Language Models Reason Like Human Doctors in Medical Image Diagnosis?

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether vision-language models (VLMs) possess evidence-based, doctor-like clinical reasoning capabilities—beyond superficial pattern matching—in medical image interpretation. Method: We introduce DrVD-Bench, the first multimodal benchmark for clinical visual reasoning, comprising 7,789 image–question pairs across 20 tasks, 17 diagnostic categories, and 5 medical imaging modalities. We formalize clinical reasoning into three structured evaluation modules: modality identification, lesion localization, and diagnostic inference. A triple-layered evaluation framework—grounded in multi-granularity human annotations—assesses visual evidence understanding, reasoning trajectory fidelity, and report generation, enabling cross-model, cross-modal, and cross-task comparability. Contribution/Results: Experiments on 19 VLMs reveal a pronounced performance drop with increasing reasoning complexity; while some models exhibit nascent human-like reasoning behaviors, widespread reliance on dataset shortcuts and insufficient visual-semantic grounding persist, undermining clinical reliability.

Technology Category

Application Category

📝 Abstract
Vision-language models (VLMs) exhibit strong zero-shot generalization on natural images and show early promise in interpretable medical image analysis. However, existing benchmarks do not systematically evaluate whether these models truly reason like human clinicians or merely imitate superficial patterns. To address this gap, we propose DrVD-Bench, the first multimodal benchmark for clinical visual reasoning. DrVD-Bench consists of three modules: Visual Evidence Comprehension, Reasoning Trajectory Assessment, and Report Generation Evaluation, comprising a total of 7,789 image-question pairs. Our benchmark covers 20 task types, 17 diagnostic categories, and five imaging modalities-CT, MRI, ultrasound, radiography, and pathology. DrVD-Bench is explicitly structured to reflect the clinical reasoning workflow from modality recognition to lesion identification and diagnosis. We benchmark 19 VLMs, including general-purpose and medical-specific, open-source and proprietary models, and observe that performance drops sharply as reasoning complexity increases. While some models begin to exhibit traces of human-like reasoning, they often still rely on shortcut correlations rather than grounded visual understanding. DrVD-Bench offers a rigorous and structured evaluation framework to guide the development of clinically trustworthy VLMs.
Problem

Research questions and friction points this paper is trying to address.

Evaluates if VLMs reason like human doctors in medical diagnosis
Assesses clinical visual reasoning across diverse medical imaging modalities
Identifies reliance on shortcuts versus grounded visual understanding in VLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal benchmark for clinical visual reasoning
Evaluates visual evidence and reasoning trajectory
Covers diverse diagnostic categories and modalities
🔎 Similar Papers
No similar papers found.
T
Tianhong Zhou
Tsinghua University
Yin Xu
Yin Xu
Beijing Jiaotong University
Power Grid ResilienceElectricity-Transportation Integrated SystemPower System High-Performance Simulation
Y
Yingtao Zhu
Tsinghua University
C
Chuxi Xiao
Tsinghua University
H
Haiyang Bian
Tsinghua University
L
Lei Wei
Tsinghua University
Xuegong Zhang
Xuegong Zhang
Tsinghua University
BioinformaticsComputational BiologyPattern RecognitionMachine Learning