Evaluation of Vision Transformers for Multimodal Image Classification: A Case Study on Brain, Lung, and Kidney Tumors

📅 2025-02-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Assessing the generalization capability of Vision Transformers for multi-modal (MRI/CT) medical image classification across brain, lung, and kidney tumors remains challenging due to inter-organ anatomical variability and modality-specific artifacts. Method: We propose a cross-organ, cross-modal multi-label classification framework integrating transfer learning-based fine-tuning, dual-modality feature fusion, and systematic ablation studies. For the first time, we comparatively evaluate Swin Transformer and MaxViT under joint multi-modal training. Contribution/Results: Swin Transformer demonstrates superior modality robustness and cross-organ generalization, achieving 99.9% accuracy on kidney tumor classification and 99.3% on the combined multi-modal dataset. In contrast, MaxViT—while excelling in single-modality tasks—suffers significant performance degradation in modality-coupled settings. Our findings highlight the critical role of local inductive bias in modeling heterogeneous medical imaging sources, providing empirical evidence and methodological guidance for adapting Transformer architectures to clinical multi-modal diagnostic applications.

Technology Category

Application Category

📝 Abstract
Neural networks have become the standard technique for medical diagnostics, especially in cancer detection and classification. This work evaluates the performance of Vision Transformers architectures, including Swin Transformer and MaxViT, in several datasets of magnetic resonance imaging (MRI) and computed tomography (CT) scans. We used three training sets of images with brain, lung, and kidney tumors. Each dataset includes different classification labels, from brain gliomas and meningiomas to benign and malignant lung conditions and kidney anomalies such as cysts and cancers. This work aims to analyze the behavior of the neural networks in each dataset and the benefits of combining different image modalities and tumor classes. We designed several experiments by fine-tuning the models on combined and individual image modalities. The results revealed that the Swin Transformer provided high accuracy, achieving up to 99.9% for kidney tumor classification and 99.3% accuracy in a combined dataset. MaxViT also provided excellent results in individual datasets but performed poorly when data is combined. This research highlights the adaptability of Transformer-based models to various image modalities and features. However, challenges persist, including limited annotated data and interpretability issues. Future works will expand this study by incorporating other image modalities and enhancing diagnostic capabilities. Integrating these models across diverse datasets could mark a pivotal advance in precision medicine, paving the way for more efficient and comprehensive healthcare solutions.
Problem

Research questions and friction points this paper is trying to address.

Evaluate Vision Transformers for tumor classification.
Compare Swin Transformer and MaxViT on MRI/CT scans.
Analyze multimodal image benefits in medical diagnostics.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates Vision Transformers for tumor classification
Combines MRI and CT scans for improved accuracy
Fine-tunes Swin Transformer and MaxViT models
🔎 Similar Papers
No similar papers found.
Ó
Óscar A. Martín
Centro de Tecnologías de la Imagen (CTIM), Instituto Universitario de Cibernética, Empresas y Sociedad (IUCES), University of Las Palmas de Gran Canaria, 35017 Las Palmas de Gran Canaria, Spain
Javier Sánchez
Javier Sánchez
University of Las Palmas de Gran Canaria
Computer VisionMachine LearningOptimization