AdaViT: Adaptive Vision Transformer for Flexible Pretrain and Finetune with Variable 3D Medical Image Modalities

📅 2025-04-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In real-world clinical MRI, variable contrast combinations cause substantial performance degradation in pretrained models under modality mismatch. To address this, we propose a modality-agnostic adaptive Vision Transformer architecture. Its core innovations include: (1) a dynamic tokenizer that adapts to input contrast subsets; (2) a variable-length cross-modality attention mechanism; (3) modality-aware positional embeddings; and (4) a dual-path pretraining framework integrating supervised and self-supervised learning. The method accepts arbitrary subsets of MR contrasts (e.g., T1, T2, FLAIR) per patient without requiring modality alignment or interpolation. Evaluated on ischemic stroke and brain tumor segmentation, it significantly improves zero-shot inference, few-shot fine-tuning, and reverse transfer capabilities. Even with 50% contrast missing, it retains over 92% of baseline performance, effectively overcoming the multimodal inconsistency bottleneck.

Technology Category

Application Category

📝 Abstract
Pretrain techniques, whether supervised or self-supervised, are widely used in deep learning to enhance model performance. In real-world clinical scenarios, different sets of magnetic resonance (MR) contrasts are often acquired for different subjects/cases, creating challenges for deep learning models assuming consistent input modalities among all the cases and between pretrain and finetune. Existing methods struggle to maintain performance when there is an input modality/contrast set mismatch with the pretrained model, often resulting in degraded accuracy. We propose an adaptive Vision Transformer (AdaViT) framework capable of handling variable set of input modalities for each case. We utilize a dynamic tokenizer to encode different input image modalities to tokens and take advantage of the characteristics of the transformer to build attention mechanism across variable length of tokens. Through extensive experiments, we demonstrate that this architecture effectively transfers supervised pretrained models to new datasets with different input modality/contrast sets, resulting in superior performance on zero-shot testing, few-shot finetuning, and backward transferring in brain infarct and brain tumor segmentation tasks. Additionally, for self-supervised pretrain, the proposed method is able to maximize the pretrain data and facilitate transferring to diverse downstream tasks with variable sets of input modalities.
Problem

Research questions and friction points this paper is trying to address.

Handles variable input modalities in 3D medical imaging
Addresses pretrain-finetune modality mismatch for better accuracy
Enables flexible transfer learning across diverse medical tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Vision Transformer for variable input modalities
Dynamic tokenizer encodes different image modalities
Attention mechanism handles variable token lengths
🔎 Similar Papers
No similar papers found.
B
B. K. Das
Siemens Healthineers AG
G
Gengyan Zhao
Siemens Medical Solutions USA, Inc.
H
Han Liu
Siemens Medical Solutions USA, Inc.
T
Thomas J. Re
Siemens Medical Solutions USA, Inc.
D
D. Comaniciu
Siemens Medical Solutions USA, Inc.
E
E. Gibson
Siemens Medical Solutions USA, Inc.
Andreas K. Maier
Andreas K. Maier
Friedrich-Alexander-Universität Erlangen-Nürnberg
pattern recognitionmachine learningspeech processingmedical speech processingimage reconstruction