🤖 AI Summary
This study addresses the significant performance degradation in cross-modal transfer from CT to MRI segmentation, where pretrained Transformers suffer from attention dispersion and ineffective feature adaptation due to zero-padding. The authors identify that this deterioration stems from geometric mismatch in input structures and the transfer of invalid features. To mitigate these issues, they propose a targeted intervention strategy: introducing the Attention Dilution Index (ADI) to quantitatively assess interference from padding tokens, combined with tumor-aware data augmentation and anisotropic cropping to enhance cross-modal robustness. Evaluated on rectal MRI data, their approach achieves tumor detection rates of 90.7% and 88.7% using SMIT and Swin UNETR backbones, respectively, substantially outperforming baseline methods.
📝 Abstract
Pretraining on large-scale datasets has been shown to improve transformer generalizability, even for out-of-domain (OOD) modalities and tasks. However, two common assumptions often fail under OOD transfer: that downstream datasets can be adapted to the fixed input geometry of pretrained models and that pretrained representations transfer effectively across imaging modalities. We show that these assumptions break down through two interacting failure modes in CT-to-MRI transfer: inefficient token usage caused by zero-padding to match pretrained input dimensions and ineffective feature adaptation. These failures led to accuracy degradation despite extensive fine-tuning. We investigated these failure modes using two CT-pretrained hierarchical shifted-window transformer backbones, SMIT and Swin UNETR, pretrained with different objectives and datasets. Mechanistic analysis introduced an attention dilution index (ADI), an entropy-based metric quantifying attention diverted toward uninformative padding tokens, and centered kernel alignment (CKA) to measure feature reuse in MRI tasks. ADI increased with zero-padding, while high feature reuse did not necessarily correspond to improved accuracy. To mitigate these issues, we introduced two interventions: a tumor-aware augmentation strategy to improve tumor appearance heterogeneity coverage and an anisotropic cropping strategy to restore token efficiency. Fine-tuning on identical rectal MRI datasets improved detection rates to 224/247 (90.7%) for SMIT and 219/247 (88.7%) for Swin UNETR, demonstrating improved robustness under CT-to-MRI transfer. This study is among the first to examine when pretrained transformers fail to transfer effectively across imaging modalities and how simple mitigation strategies, motivated by mechanistic analysis of datasets, can reduce transfer limitations while improving robustness and MRI detection.