🤖 AI Summary
This work addresses the limitations of conventional Vision Transformers in data-scarce medical imaging scenarios, where high computational complexity and large parameter counts hinder performance. The authors propose an efficient Vision Transformer architecture based on spectral-domain representations, leveraging spectral basis functions that exhibit spatial invariance and optimal signal-to-noise ratio for spectral projection tokenization. This approach substantially reduces model complexity and parameter count while maintaining or even improving performance, leading to significantly enhanced computational efficiency. Experimental results demonstrate that the proposed model consistently matches or outperforms state-of-the-art methods—including CNNs, Vision Transformers, and MLP-based architectures—across simulated, public, and clinical medical image datasets, highlighting its superior few-shot adaptability and practical utility in resource-constrained medical applications.
📝 Abstract
We propose a novel spectral vision transformer architecture for efficient tokenization in limited data, with an emphasis on medical imaging. We outline convenient theoretical properties arising from the choice of basis including spatial invariance and optimal signal-to-noise ratio. We show reduced complexity arising from the spectral projection compared to spatial vision transformers. We show equitable or superior performance with a reduced number of parameters as compared to a variety of models including compact and standard vision transformers, convolutional neural networks with attention, shifted window transformers, multi-layer perceptrons, and logistic regression. We include simulated, public, and clinical data in our analysis and release our code at: \verb+github.com/agr78/spectralViT+.