Shaken or Stirred? An Analysis of MetaFormer's Token Mixing for Medical Imaging

📅 2025-10-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Token mixers in MetaFormer architectures lack systematic comparative evaluation for medical imaging tasks. Method: This paper conducts the first comprehensive benchmark of three token mixer families—pooling, convolution (including grouped convolution), and attention—across eight cross-modality medical image datasets, assessing performance on both classification and segmentation tasks, and analyzing the efficacy of pretrained weight transfer. Results: Low-complexity mixers (e.g., pooling) achieve superior classification accuracy; convolutional mixers—particularly grouped convolutions—deliver optimal trade-offs between segmentation accuracy and computational efficiency; and pretrained weights transfer effectively across diverse mixer types, substantially enhancing generalization. This study provides empirical evidence and practical guidelines for modular MetaFormer design in medical imaging.

Technology Category

Application Category

📝 Abstract
The generalization of the Transformer architecture via MetaFormer has reshaped our understanding of its success in computer vision. By replacing self-attention with simpler token mixers, MetaFormer provides strong baselines for vision tasks. However, while extensively studied on natural image datasets, its use in medical imaging remains scarce, and existing works rarely compare different token mixers, potentially overlooking more suitable designs choices. In this work, we present the first comprehensive study of token mixers for medical imaging. We systematically analyze pooling-, convolution-, and attention-based token mixers within the MetaFormer architecture on image classification (global prediction task) and semantic segmentation (dense prediction task). Our evaluation spans eight datasets covering diverse modalities and common challenges in the medical domain. Given the prevalence of pretraining from natural images to mitigate medical data scarcity, we also examine transferring pretrained weights to new token mixers. Our results show that, for classification, low-complexity token mixers (e.g. grouped convolution or pooling) are sufficient, aligning with findings on natural images. Pretrained weights remain useful despite the domain gap introduced by the new token mixer. For segmentation, we find that the local inductive bias of convolutional token mixers is essential. Grouped convolutions emerge as the preferred choice, as they reduce runtime and parameter count compared to standard convolutions, while the MetaFormer's channel-MLPs already provide the necessary cross-channel interactions. Our code is available on GitHub.
Problem

Research questions and friction points this paper is trying to address.

Analyzing token mixer effectiveness in medical imaging tasks
Comparing pooling, convolution, and attention mechanisms in MetaFormer
Evaluating pretrained weight transfer across different token mixers
Innovation

Methods, ideas, or system contributions that make the work stand out.

MetaFormer architecture with diverse token mixers
Pooling, convolution, attention token mixers for medical imaging
Grouped convolutions reduce parameters while maintaining performance
🔎 Similar Papers
No similar papers found.
R
Ron Keuth
Medical Informatics, University of Lübeck, Ratzeburger Allee 160, 23562, Lübeck, Germany.
P
Paul Kaftan
Medical Systems Biology, Ulm University, Albert-Einstein-Allee 11, 89081, Ulm, Germany.
Mattias P. Heinrich
Mattias P. Heinrich
University of Luebeck
Medical Image AnalysisDeep Machine Learning