🤖 AI Summary
In real-world clinical MRI, variable contrast combinations cause substantial performance degradation in pretrained models under modality mismatch. To address this, we propose a modality-agnostic adaptive Vision Transformer architecture. Its core innovations include: (1) a dynamic tokenizer that adapts to input contrast subsets; (2) a variable-length cross-modality attention mechanism; (3) modality-aware positional embeddings; and (4) a dual-path pretraining framework integrating supervised and self-supervised learning. The method accepts arbitrary subsets of MR contrasts (e.g., T1, T2, FLAIR) per patient without requiring modality alignment or interpolation. Evaluated on ischemic stroke and brain tumor segmentation, it significantly improves zero-shot inference, few-shot fine-tuning, and reverse transfer capabilities. Even with 50% contrast missing, it retains over 92% of baseline performance, effectively overcoming the multimodal inconsistency bottleneck.
📝 Abstract
Pretrain techniques, whether supervised or self-supervised, are widely used in deep learning to enhance model performance. In real-world clinical scenarios, different sets of magnetic resonance (MR) contrasts are often acquired for different subjects/cases, creating challenges for deep learning models assuming consistent input modalities among all the cases and between pretrain and finetune. Existing methods struggle to maintain performance when there is an input modality/contrast set mismatch with the pretrained model, often resulting in degraded accuracy. We propose an adaptive Vision Transformer (AdaViT) framework capable of handling variable set of input modalities for each case. We utilize a dynamic tokenizer to encode different input image modalities to tokens and take advantage of the characteristics of the transformer to build attention mechanism across variable length of tokens. Through extensive experiments, we demonstrate that this architecture effectively transfers supervised pretrained models to new datasets with different input modality/contrast sets, resulting in superior performance on zero-shot testing, few-shot finetuning, and backward transferring in brain infarct and brain tumor segmentation tasks. Additionally, for self-supervised pretrain, the proposed method is able to maximize the pretrain data and facilitate transferring to diverse downstream tasks with variable sets of input modalities.