AdaViT: Adaptive Vision Transformer for Flexible Pretrain and Finetune with Variable 3D Medical Image Modalities

📅 2025-04-04

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

In real-world clinical MRI, variable contrast combinations cause substantial performance degradation in pretrained models under modality mismatch. To address this, we propose a modality-agnostic adaptive Vision Transformer architecture. Its core innovations include: (1) a dynamic tokenizer that adapts to input contrast subsets; (2) a variable-length cross-modality attention mechanism; (3) modality-aware positional embeddings; and (4) a dual-path pretraining framework integrating supervised and self-supervised learning. The method accepts arbitrary subsets of MR contrasts (e.g., T1, T2, FLAIR) per patient without requiring modality alignment or interpolation. Evaluated on ischemic stroke and brain tumor segmentation, it significantly improves zero-shot inference, few-shot fine-tuning, and reverse transfer capabilities. Even with 50% contrast missing, it retains over 92% of baseline performance, effectively overcoming the multimodal inconsistency bottleneck.

Technology Category

Application Category

📝 Abstract

Pretrain techniques, whether supervised or self-supervised, are widely used in deep learning to enhance model performance. In real-world clinical scenarios, different sets of magnetic resonance (MR) contrasts are often acquired for different subjects/cases, creating challenges for deep learning models assuming consistent input modalities among all the cases and between pretrain and finetune. Existing methods struggle to maintain performance when there is an input modality/contrast set mismatch with the pretrained model, often resulting in degraded accuracy. We propose an adaptive Vision Transformer (AdaViT) framework capable of handling variable set of input modalities for each case. We utilize a dynamic tokenizer to encode different input image modalities to tokens and take advantage of the characteristics of the transformer to build attention mechanism across variable length of tokens. Through extensive experiments, we demonstrate that this architecture effectively transfers supervised pretrained models to new datasets with different input modality/contrast sets, resulting in superior performance on zero-shot testing, few-shot finetuning, and backward transferring in brain infarct and brain tumor segmentation tasks. Additionally, for self-supervised pretrain, the proposed method is able to maximize the pretrain data and facilitate transferring to diverse downstream tasks with variable sets of input modalities.

Problem

Research questions and friction points this paper is trying to address.

Handles variable input modalities in 3D medical imaging

Addresses pretrain-finetune modality mismatch for better accuracy

Enables flexible transfer learning across diverse medical tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Vision Transformer for variable input modalities

Dynamic tokenizer encodes different image modalities

Attention mechanism handles variable token lengths

🔎 Similar Papers

Self Pre-training with Topology- and Spatiality-aware Masked Autoencoders for 3D Medical Image Segmentation