Med-StepBench: A Hierarchical Reasoning Framework for Evaluating Hallucinations in Medical Vision-Language Models

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

202K/year
🤖 AI Summary
Existing medical vision-language models are prone to hallucinations in multi-step diagnostic reasoning that evade detection by current benchmarks. To address this, this work proposes the first hierarchical evaluation framework tailored for 3D PET/CT imaging, decomposing clinical diagnosis into four expert-designed stages. Leveraging over 12,000 3D scans and a million-scale image-sentence pairs, along with physician-validated annotations, the framework enables fine-grained assessment of both general-purpose and medical-specific models. The benchmark uncovers systematic errors masked by aggregate metrics and reveals model susceptibility to clinically plausible yet adversarial intermediate interpretations when reliable visual evidence is absent, thereby establishing a rigorous foundation for developing safe and trustworthy medical vision-language models.
📝 Abstract
Large vision-language models (VLMs) demonstrate strong performance in medical image understanding, but frequently generate clinically plausible yet incorrect statements, raising significant safety concerns. Existing medical hallucination benchmarks primarily focus on 2D imaging with one-shot diagnostic questions, offering limited insight into whether predictions are grounded in correct localization and abnormality identification, allowing critical reasoning errors to remain hidden behind seemingly correct diagnoses. We introduce Med-StepBench, the first large-scale benchmark for step-wise hallucination detection in 3D oncological PET/CT, comprising over 12,000 images and more than 1,000,000 image-statement pairs across volumetric and multi-view 2D data, which decomposes clinical reasoning into four expert-designed diagnostic stages. Using clinician-verified annotations, we perform the first step-level evaluation of general-purpose and medical VLMs, revealing systematic failure modes obscured by aggregate accuracy metrics. Furthermore, we show that current VLMs are highly susceptible to adversarial yet clinically plausible intermediate explanations, which significantly amplify hallucinations despite contradictory visual evidence. Together, our findings highlight fundamental limitations in grounding multi-step clinical reasoning and establish Med-StepBench as a rigorous benchmark for developing safer and more reliable medical VLMs.
Problem

Research questions and friction points this paper is trying to address.

medical hallucination
vision-language models
clinical reasoning
PET/CT
benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

step-wise hallucination detection
medical vision-language models
3D PET/CT
hierarchical clinical reasoning
adversarial intermediate explanations
🔎 Similar Papers
No similar papers found.
M
Minh Khoi Nguyen
AI4LIFE, Hanoi University of Science and Technology, Vietnam
D
Dai Lam Le
AI4LIFE, Hanoi University of Science and Technology, Vietnam
A
Amir Reza Jafari
SAMOV AR, Télécom SudParis, Institut Polytechnique de Paris, France
Tuan Dung Nguyen
Tuan Dung Nguyen
University of Pennsylvania
Computational Social ScienceAI For Science
M
Mai Hong Son
108 Military Central Hospital, Vietnam
M
Mai Huy Thong
108 Military Central Hospital, Vietnam
Q
Quang Huy Nguyen
Hanoi Medical University
Thanh Trung Nguyen
Thanh Trung Nguyen
Le Quy Don Technical University, Viet Nam
blockchainend-to-end encryptionnosqlkey-valuebig data
Reza Farahbakhsh
Reza Farahbakhsh
PhD, Lead Data Scientist at TotalEnergies, Adjunct Associate Professor at IP-Paris SudParis
NLP/U/GLanguage ModellingSocial NetworksIoTData Science
Noel Crespi
Noel Crespi
Professor @ Telecom SudParis, Institut Mines-Telecom, Institut Polytechnique de Paris
Edge IntelligenceIoTDigital TwinArtificial IntelligenceNLP
P
Phi Le Nguyen
AI4LIFE, Hanoi University of Science and Technology, Vietnam