🤖 AI Summary
This study addresses three key challenges in clinical multimodal prediction: insufficient modality fusion, performance degradation, and lack of fairness. We propose the first systematic contrastive learning framework covering all five data modalities in MIMIC-IV—clinical text, medical images, time-series physiological signals, structured clinical variables, and demographic information. Methodologically, we introduce a modality-gated LSTM to mitigate cross-modal interference, incorporate importance-scoring–based contrastive learning to enhance interpretability, and employ subgroup-wise generalization evaluation to ensure fairness. On in-hospital mortality and phenotype prediction tasks, the five-modality model with gating achieves significant gains: AUROC of 76.93% (+3.74), AUPRC of 62.26% (+10.99), approaching supervised baselines. Ablation across 26 modality combinations provides reproducible guidance for clinical multimodal model selection and training strategy design.
📝 Abstract
Multimodal deep learning holds promise for improving clinical prediction by integrating diverse patient data, including text, imaging, time-series, and structured demographics. Contrastive learning facilitates this integration by producing a unified representation that can be reused across tasks, reducing the need for separate models or encoders. Although contrastive learning has seen success in vision-language domains, its use in clinical settings remains largely limited to image and text pairs. We propose the Pipeline for Contrastive Modality Evaluation and Encoding (PiCME), which systematically assesses five clinical data types from MIMIC: discharge summaries, radiology reports, chest X-rays, demographics, and time-series. We pre-train contrastive models on all 26 combinations of two to five modalities and evaluate their utility on in-hospital mortality and phenotype prediction. To address performance plateaus with more modalities, we introduce a Modality-Gated LSTM that weights each modality according to its contrastively learned importance. Our results show that contrastive models remain competitive with supervised baselines, particularly in three-modality settings. Performance declines beyond three modalities, which supervised models fail to recover. The Modality-Gated LSTM mitigates this drop, improving AUROC from 73.19% to 76.93% and AUPRC from 51.27% to 62.26% in the five-modality setting. We also compare contrastively learned modality importance scores with attribution scores and evaluate generalization across demographic subgroups, highlighting strengths in interpretability and fairness. PiCME is the first to scale contrastive learning across all modality combinations in MIMIC, offering guidance for modality selection, training strategies, and equitable clinical prediction.