🤖 AI Summary
This work addresses the limitations of conventional Vision Transformers (ViTs) in medical imaging, where positional encodings and the [CLS] token impair generalization due to weak or inconsistent spatial priors—particularly under resource-constrained conditions. The authors propose ZACH-ViT, a compact ViT architecture that, for the first time in small-scale ViTs, entirely eliminates positional encodings and the [CLS] token. Instead, it employs global average pooling to achieve permutation invariance and introduces adaptive residual projection to ensure training stability. By dynamically adjusting its inductive bias according to the structural characteristics of medical images, ZACH-ViT demonstrates strong performance across seven MedMNIST datasets: with only 0.25M parameters and trained from scratch, it achieves state-of-the-art accuracy on BloodMNIST, matches TransMIL on PathMNIST, and maintains inference times under one second, making it suitable for edge-based clinical deployment.
📝 Abstract
Vision Transformers rely on positional embeddings and class tokens that encode fixed spatial priors. While effective for natural images, these priors may hinder generalization when spatial layout is weakly informative or inconsistent, a frequent condition in medical imaging and edge-deployed clinical systems. We introduce ZACH-ViT (Zero-token Adaptive Compact Hierarchical Vision Transformer), a compact Vision Transformer that removes both positional embeddings and the [CLS] token, achieving permutation invariance through global average pooling over patch representations. The term"Zero-token"specifically refers to removing the dedicated [CLS] aggregation token and positional embeddings; patch tokens remain unchanged and are processed normally. Adaptive residual projections preserve training stability in compact configurations while maintaining a strict parameter budget. Evaluation is performed across seven MedMNIST datasets spanning binary and multi-class tasks under a strict few-shot protocol (50 samples per class, fixed hyperparameters, five random seeds). The empirical analysis demonstrates regime-dependent behavior: ZACH-ViT (0.25M parameters, trained from scratch) achieves its strongest advantage on BloodMNIST and remains competitive with TransMIL on PathMNIST, while its relative advantage decreases on datasets with strong anatomical priors (OCTMNIST, OrganAMNIST), consistent with the architectural hypothesis. These findings support the view that aligning architectural inductive bias with data structure can be more important than pursuing universal benchmark dominance. Despite its minimal size and lack of pretraining, ZACH-ViT achieves competitive performance while maintaining sub-second inference times, supporting deployment in resource-constrained clinical environments. Code and models are available at https://github.com/Bluesman79/ZACH-ViT.