🤖 AI Summary
This study addresses the challenge of medical image representation learning under clinical annotation scarcity. We propose a tabular-data-guided contrastive learning framework that leverages structured clinical data—such as demographics and physiological metrics—to inform patient-level positive/negative sample pairing for cardiac short-axis MRI, without requiring joint embedding or additional annotations. Our key contribution is the first use of tabular data as an external semantic signal to implicitly guide a unimodal image encoder toward learning clinically interpretable representations. Integrating multi-source clinical data from UK Biobank with MRI-specific augmentation strategies, our method significantly outperforms both pure image-augmentation and joint-embedding baselines on cardiovascular disease classification and cardiac phenotyping tasks. Notably, the image encoder spontaneously captures critical clinical attributes—including age and sex—enhancing generalization and zero-shot inference capability in real-world clinical settings.
📝 Abstract
Contrastive learning methods in computer vision typically rely on different views of the same image to form pairs. However, in medical imaging, we often seek to compare entire patients with different phenotypes rather than just multiple augmentations of one scan. We propose harnessing clinically relevant tabular data to identify distinct patient phenotypes and form more meaningful pairs in a contrastive learning framework. Our method uses tabular attributes to guide the training of visual representations, without requiring a joint embedding space. We demonstrate its strength using short-axis cardiac MR images and clinical attributes from the UK Biobank, where tabular data helps to more effectively distinguish between patient subgroups. Evaluation on downstream tasks, including fine-tuning and zero-shot prediction of cardiovascular artery diseases and cardiac phenotypes, shows that incorporating tabular data yields stronger visual representations than conventional methods that rely solely on image augmentations or combined image-tabular embeddings. Furthermore, we demonstrate that image encoders trained with tabular guidance are capable of embedding demographic information in their representations, allowing them to use insights from tabular data for unimodal predictions, making them well-suited to real-world medical settings where extensive clinical annotations may not be routinely available at inference time. The code will be available on GitHub.