Vision-Core Guided Contrastive Learning for Balanced Multi-modal Prognosis Prediction of Stroke

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This study addresses the limitations of existing ischemic stroke outcome prediction methods, which are largely confined to bimodal fusion and struggle to effectively integrate medical imaging, structured clinical data, and unstructured text while lacking deep inter-modal interaction. To overcome these challenges, this work proposes the first trimodal deep fusion framework. It leverages a large language model (LLM) to automatically generate semi-structured diagnostic reports from brain MRI scans and introduces a Vision-guided Dual Alignment Fusion Module (VDAFM) that uses visual features as conditional priors to enable fine-grained, dynamic interaction between imaging and textual modalities. By incorporating contrastive learning and semantic alignment losses, the framework substantially mitigates modality heterogeneity and achieves significantly improved prediction accuracy and robustness on real-world clinical data.

📝 Abstract

Deep learning and multi-modal fusion have demonstrated transformative potential in medical diagnosis by integrating diverse data sources. However, accurate prognosis for ischemic stroke remains challenging due to limitations in existing multi-modal approaches. First, current methods are predominantly confined to dual-modal fusion, lacking a framework that effectively integrates the trifecta of medical images, structured clinical data, and unstructured text. Second, they often fail to establish deep bidirectional interactions between modalities; To address these critical gaps, this paper proposes a novel tri-modal fusion model for ischemic stroke prognosis. Our approach first enriches the data representation by employing a Large Language Model (LLM) to automatically generate semi-structured diagnostic text from brain MRIs. This process not only addresses the scarcity of expert annotations but also serves as a regularized semantic enhancement, improving multimodal fusion robustness. Furthermore, we design a core component termed the Vision-Conditioned Dual Alignment Fusion Module (VDAFM), which strategically uses visual features as a conditional prior to guide fine-grained interaction with the generated text. This module achieves a dynamic and profound fusion through a dual semantic alignment loss, effectively mitigating modal heterogeneity. Extensive experiments on a real-world clinical dataset demonstrate that our model achieves state-of-the-art performance.

Problem

Research questions and friction points this paper is trying to address.

multi-modal fusion

ischemic stroke prognosis

medical image

clinical data

unstructured text

Innovation

Methods, ideas, or system contributions that make the work stand out.

tri-modal fusion

vision-conditioned alignment

large language model