🤖 AI Summary
This work addresses the limitations of existing deep learning approaches in longitudinal brain MRI analysis—namely, insufficient diagnostic grounding, poor interpretability, and hallucinatory predictions—which hinder their reliability in supporting cognitive prognosis for neurodegenerative diseases such as Alzheimer’s. To overcome these challenges, we propose a stepwise 3D vision-language model training framework that, for the first time, integrates regional brain volumetric quantification and longitudinal scan comparisons into a coherent reasoning chain. A clinician-weighted validator, requiring no manual annotations, drives direct preference optimization to enhance diagnostic trustworthiness and biological plausibility. Evaluated on the ADNI test set, our method achieves 93.7% three-class accuracy (a 34.8% improvement over baseline), 97.2% binary classification accuracy, and 82.6% regional anatomical classification accuracy, while demonstrating strong zero-shot transfer performance on the MIRIAD and AIBL datasets.
📝 Abstract
Longitudinal brain MRI is essential for characterizing the progression of neurological diseases such as Alzheimer's disease assessment. However, current deep-learning tools fragment this process: classifiers reduce a scan to a label, volumetric pipelines produce uninterpreted measurements, and vision-language models (VLMs) may generate fluent but potentially hallucinated conclusions. We present LoV3D, a pipeline for training 3D vision-language models, which reads longitudinal T1-weighted brain MRI, produces a region-level anatomical assessment, conducts longitudinal comparison with the prior scan, and finally outputs a three-class diagnosis (Cognitively Normal, Mild Cognitive Impairment, or Dementia) along with a synthesized diagnostic summary. The stepped pipeline grounds the final diagnosis by enforcing label consistency, longitudinal coherence, and biological plausibility, thereby reducing the risks of hallucinations. The training process introduces a clinically-weighted Verifier that scores candidate outputs automatically against normative references derived from standardized volume metrics, driving Direct Preference Optimization without a single human annotation. On a subject-level held-out ADNI test set (479 scans, 258 subjects), LoV3D achieves 93.7% three-class diagnostic accuracy (+34.8% over the no-grounding baseline), 97.2% on two-class diagnosis accuracy (+4% over the SOTA) and 82.6% region-level anatomical classification accuracy (+33.1% over VLM baselines). Zero-shot transfer yields 95.4% on MIRIAD (100% Dementia recall) and 82.9% three-class accuracy on AIBL, confirming high generalizability across sites, scanners, and populations. Code is available at https://github.com/Anonymous-TEVC/LoV-3D.