π€ AI Summary
Existing medical vision-language pretraining models struggle to capture complex one-to-many and many-to-many semantic associations between medical images and clinical text. To address this, we propose the first probabilistic contrastive learning framework for medical multimodal pretraining, wherein embeddings are modeled as Gaussian distributions parameterized by mean and variance. Our method introduces an improved InfoNCE loss based on the Hellinger distance and a probabilistic compositional sampling strategy, enabling unified alignment of X-ray, electrocardiogram, echocardiogram, and clinical text modalities within a shared probabilistic embedding space. Evaluated on 13 benchmark datasets, our approach achieves significant improvements in cross-modal retrieval, zero-shot and few-shot classification, and prognostic prediction. Results demonstrate that probabilistic embedding representations enhance both the effectiveness and robustness of multimodalεε analysis in clinical settings.
π Abstract
Medical decision-making requires integrating diverse medical information, from imaging to clinical narratives. These medical modalities are often acquired in a many-to-many manner. However, current medical vision-language pretraining models (Med-VLPMs) fail to directly account for this many-to-many mapping in their model training and embeddings. To address this, we present Probabilistic Modality-Enhanced Diagnosis (ProbMED), a multimodal Med-VLPM that employs probabilistic contrastive learning to model distributions over embeddings rather than deterministic estimates. ProbMED aligns four distinct modalities--chest X-rays, electrocardiograms, echocardiograms, and clinical text--into a unified probabilistic embedding space. We use InfoNCE loss with Hellinger distance to integrate inter-modality distributions. We introduce a probabilistic synthetic sampling loss that captures modality-specific mean and variance to improve intra-modality binding. Extensive experiments across 13 medical datasets demonstrate that our model outperforms current Med-VLPMs in cross-modality retrieval, zero-shot, and few-shot classification. We also demonstrate the robust integration of multiple modalities for prognostication, showing improved intra- and inter-medical modality binding.