🤖 AI Summary
Addressing the challenges of fusing heterogeneous multimodal medical data—specifically MRI/PET imaging and structured/unstructured clinical text—and insufficient cross-modal semantic alignment, this paper proposes a Transformer-based dual-granularity alignment and fusion framework. Methodologically, it jointly optimizes self-supervised image/text reconstruction and cross-modal contrastive learning to achieve synergistic alignment at both low-level (e.g., anatomical details) and high-level (e.g., diagnostic concepts) semantic granularities. Furthermore, it introduces vision-language tokenization and cross-modal attention mechanisms to unify representations across disparate modalities. Evaluated on five large-scale Alzheimer’s disease datasets, the framework consistently outperforms eight state-of-the-art baselines, achieving new SOTA performance in classification and prognosis prediction tasks. Ablation studies confirm the efficacy of each design component, while cross-dataset generalization tests demonstrate robustness across diverse clinical sites and acquisition protocols. These results validate the framework’s effectiveness and broad applicability for computer-aided diagnosis in neurodegenerative disorders.
📝 Abstract
Medical data collected for diagnostic decisions are typically multimodal, providing comprehensive information on a subject. While computer-aided diagnosis systems can benefit from multimodal inputs, effectively fusing such data remains a challenging task and a key focus in medical research. In this paper, we propose a transformer-based framework, called Alifuse, for aligning and fusing multimodal medical data. Specifically, we convert medical images and both unstructured and structured clinical records into vision and language tokens, employing intramodal and intermodal attention mechanisms to learn unified representations of all imaging and non-imaging data for classification. Additionally, we integrate restoration modeling with contrastive learning frameworks, jointly learning the high-level semantic alignment between images and texts and the low-level understanding of one modality with the help of another. We apply Alifuse to classify Alzheimer’s disease, achieving state-of-the-art performance on five public datasets and outperforming eight baselines. The source code is available at https://github.com/Qybc/Alifuse.