MultiMedVision: Multi-Modal Medical Vision Framework

📅 2026-05-09
📈 Citations: 0
Influential: 0
📄 PDF

career value

194K/year
🤖 AI Summary
This work addresses the challenge of unified multimodal representation learning in medical imaging, where existing foundation models typically rely on modality-specific architectures to separately handle 2D (e.g., X-ray) and 3D (e.g., CT) data. The authors propose a unified framework based on a sparse vision transformer that directly processes mixed-dimensional medical image batches within a shared latent space, leveraging 3D rotational positional encoding and variable-length sequence packing. This approach enables joint 2D/3D representation learning without requiring modality-specific adapters or 3D slice-wise decomposition—achieving this capability for the first time. The method reveals coexisting modality-specific and shared feature subspaces and demonstrates competitive performance on MIMIC-CXR (AUROC 0.82), CheXpert (0.84), and CT-RATE (0.85) using only one-fifth of the typical training data.
📝 Abstract
Multi-modal medical imaging enables comprehensive diagnostics, yet current foundation models process 2D (e.g. X-ray) and 3D (e.g. CT) data with separate, dimensionality-specific architectures. We present MultiMedVision, a unified framework for joint 2D/3D representation learning built on a Sparse Vision Transformer. Our model uses 3D Rotary Positional Embeddings and variable-length sequence packing to process mixed-modality batches natively within a shared latent space, without modality-specific adapters or treating 3D volumes as 2D slice sequences. Trained with a self-supervised objective on chest X-rays (MIMIC-CXR) and CT scans (CT-RATE), and using a single shared encoder with 5x less data, MultiMedVision achieves competitive performance on both 2D benchmarks (Macro AUROC 0.82 on MIMIC, 0.84 on CheXpert) and 3D tasks (0.85 on CT-RATE). Analysis of the learned representations reveals coexisting modality-specific and shared feature subspaces, demonstrating that unified cross-dimensional representation learning is feasible without sacrificing modality-specific performance.
Problem

Research questions and friction points this paper is trying to address.

multi-modal medical imaging
2D/3D representation learning
foundation models
medical vision
cross-dimensional learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Vision Transformer
3D Rotary Positional Embeddings
multi-modal medical imaging
unified representation learning
variable-length sequence packing