🤖 AI Summary
This work addresses the misalignment between existing deep learning models and clinical guidelines in Alzheimer’s disease diagnosis, as well as the lack of traceability from model decisions to anatomical evidence. To bridge this gap, the authors propose a vision–language framework that establishes fine-grained associations among diagnostic statements, multimodal evidence, and anatomical structures through a Structure–Evidence–Assertion (SEA) hierarchical alignment mechanism. The framework integrates GTX-Distill for weakly supervised knowledge distillation and an Executable-Rule GRPO strategy for reinforcement-based fine-tuning grounded in verifiable clinical rules, thereby enhancing model transparency and clinical consistency. Evaluated on the AD-MultiSense dataset, the method achieves state-of-the-art diagnostic accuracy while generating structured reports that are anatomically faithful and evidentially traceable.
📝 Abstract
Deep learning models for medical image analysis often act as black boxes, seldom aligning with clinical guidelines or explicitly linking decisions to supporting evidence. This is especially critical in Alzheimer's disease (AD), where predictions should be grounded in both anatomical and clinical findings. We present EMAD, a vision-language framework that generates structured AD diagnostic reports in which each claim is explicitly grounded in multimodal evidence. EMAD uses a hierarchical Sentence-Evidence-Anatomy (SEA) grounding mechanism: (i) sentence-to-evidence grounding links generated sentences to clinical evidence phrases, and (ii) evidence-to-anatomy grounding localizes corresponding structures on 3D brain MRI. To reduce dense annotation requirements, we propose GTX-Distill, which transfers grounding behavior from a teacher trained with limited supervision to a student operating on model-generated reports. We further introduce Executable-Rule GRPO, a reinforcement fine-tuning scheme with verifiable rewards that enforces clinical consistency, protocol adherence, and reasoning-diagnosis coherence. On the AD-MultiSense dataset, EMAD achieves state-of-the-art diagnostic accuracy and produces more transparent, anatomically faithful reports than existing methods. We will release code and grounding annotations to support future research in trustworthy medical vision-language models.