🤖 AI Summary
This work addresses the challenge of unified multimodal representation learning in medical imaging, where existing foundation models typically rely on modality-specific architectures to separately handle 2D (e.g., X-ray) and 3D (e.g., CT) data. The authors propose a unified framework based on a sparse vision transformer that directly processes mixed-dimensional medical image batches within a shared latent space, leveraging 3D rotational positional encoding and variable-length sequence packing. This approach enables joint 2D/3D representation learning without requiring modality-specific adapters or 3D slice-wise decomposition—achieving this capability for the first time. The method reveals coexisting modality-specific and shared feature subspaces and demonstrates competitive performance on MIMIC-CXR (AUROC 0.82), CheXpert (0.84), and CT-RATE (0.85) using only one-fifth of the typical training data.
📝 Abstract
Multi-modal medical imaging enables comprehensive diagnostics, yet current foundation models process 2D (e.g. X-ray) and 3D (e.g. CT) data with separate, dimensionality-specific architectures. We present MultiMedVision, a unified framework for joint 2D/3D representation learning built on a Sparse Vision Transformer. Our model uses 3D Rotary Positional Embeddings and variable-length sequence packing to process mixed-modality batches natively within a shared latent space, without modality-specific adapters or treating 3D volumes as 2D slice sequences. Trained with a self-supervised objective on chest X-rays (MIMIC-CXR) and CT scans (CT-RATE), and using a single shared encoder with 5x less data, MultiMedVision achieves competitive performance on both 2D benchmarks (Macro AUROC 0.82 on MIMIC, 0.84 on CheXpert) and 3D tasks (0.85 on CT-RATE). Analysis of the learned representations reveals coexisting modality-specific and shared feature subspaces, demonstrating that unified cross-dimensional representation learning is feasible without sacrificing modality-specific performance.