VariViT: A Vision Transformer for Variable Image Sizes

📅 2026-02-16
🏛️ International Conference on Medical Imaging with Deep Learning
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of conventional Vision Transformers in medical imaging, where fixed input dimensions hinder effective modeling of irregular lesions and preprocessing often incurs information loss. The authors propose VariViT, a novel architecture that supports variable-sized inputs while maintaining a fixed patch size through a dynamic positional encoding resampling mechanism and an adaptive batching strategy. This approach preserves both representational capacity and computational efficiency by introducing a scalable positional embedding scheme tailored for variable numbers of patches. Evaluated on two 3D brain MRI datasets, VariViT achieves F1 scores of 75.5% and 76.3% on glioma genotype prediction and brain tumor classification tasks, respectively, while reducing computation time by up to 30% compared to baseline methods without compromising accuracy.

Technology Category

Application Category

📝 Abstract
Vision Transformers (ViTs) have emerged as the state-of-the-art architecture in representation learning, leveraging self-attention mechanisms to excel in various tasks. ViTs split images into fixed-size patches, constraining them to a predefined size and necessitating pre-processing steps like resizing, padding, or cropping. This poses challenges in medical imaging, particularly with irregularly shaped structures like tumors. A fixed bounding box crop size produces input images with highly variable foreground-to-background ratios. Resizing medical images can degrade information and introduce artefacts, impacting diagnosis. Hence, tailoring variable-sized crops to regions of interest can enhance feature representation capabilities. Moreover, large images are computationally expensive, and smaller sizes risk information loss, presenting a computation-accuracy tradeoff. We propose VariViT, an improved ViT model crafted to handle variable image sizes while maintaining a consistent patch size. VariViT employs a novel positional embedding resizing scheme for a variable number of patches. We also implement a new batching strategy within VariViT to reduce computational complexity, resulting in faster training and inference times. In our evaluations on two 3D brain MRI datasets, VariViT surpasses vanilla ViTs and ResNet in glioma genotype prediction and brain tumor classification. It achieves F1-scores of 75.5% and 76.3%, respectively, learning more discriminative features. Our proposed batching strategy reduces computation time by up to 30% compared to conventional architectures. These findings underscore the efficacy of VariViT in image representation learning. Our code can be found here: https://github.com/Aswathi-Varma/varivit
Problem

Research questions and friction points this paper is trying to address.

Vision Transformer
variable image sizes
medical imaging
patch-based representation
computational efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Transformer
variable image size
positional embedding resizing
efficient batching
medical image analysis
🔎 Similar Papers
No similar papers found.
A
Aswathi Varma
Department of Neuroradiology, Technical University of Munich
Suprosanna Shit
Suprosanna Shit
University of Zurich | ETH AI Center
Machine LearningMedical ImagingComputer VisionSignal Processing
C
Chinmay Prabhakar
Department of Quantitative Biomedicine, University of Zurich
D
Daniel Scholz
Department of Neuroradiology, Technical University of Munich
Hongwei Bran Li
Hongwei Bran Li
Martinos Center, MGH, Harvard Medical School
Medical Image AnalysisML
Bjoern Menze
Bjoern Menze
Universität Zürich
Biomedical Image AnalysisMedical Image AnalysisMedical Image ComputingMachine Learning
Daniel Rueckert
Daniel Rueckert
Technical University of Munich and Imperial College London
Machine LearningMedical Image ComputingBiomedical Image AnalysisComputer Vision
B
Benedikt Wiestler
Department of Neuroradiology, Technical University of Munich