Which Backbone to Use: A Resource-efficient Domain Specific Comparison for Computer Vision

📅 2024-06-09

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Lightweight backbone networks lack systematic evaluation in few-shot cross-domain vision tasks. Method: Under a unified training protocol, this work conducts the first cross-domain, cross-data-scale comparative study—spanning natural (ImageNet), medical (CheXpert), astronomical (SDSS), and remote sensing (EuroSAT) domains—evaluating ConvNeXt, RegNet, EfficientNet, and ViT architectures under low-data fine-tuning regimes. Results: CNN-based backbones—particularly ConvNeXt and RegNet—consistently outperform ViT variants under limited data, demonstrating superior generalization and cross-domain robustness. Ablation analysis uncovers principled links between architectural design choices (e.g., locality bias, inductive priors) and domain-specific data characteristics (e.g., spatial coherence, label scarcity). The study releases an open-source benchmarking framework and standardized results, providing empirically grounded guidance for backbone selection in resource-constrained settings.

Technology Category

Application Category

📝 Abstract

In contemporary computer vision applications, particularly image classification, architectural backbones pre-trained on large datasets like ImageNet are commonly employed as feature extractors. Despite the widespread use of these pre-trained convolutional neural networks (CNNs), there remains a gap in understanding the performance of various resource-efficient backbones across diverse domains and dataset sizes. Our study systematically evaluates multiple lightweight, pre-trained CNN backbones under consistent training settings across a variety of datasets, including natural images, medical images, galaxy images, and remote sensing images. This comprehensive analysis aims to aid machine learning practitioners in selecting the most suitable backbone for their specific problem, especially in scenarios involving small datasets where fine-tuning a pre-trained network is crucial. Even though attention-based architectures are gaining popularity, we observed that they tend to perform poorly under low data finetuning tasks compared to CNNs. We also observed that some CNN architectures such as ConvNeXt, RegNet and EfficientNet performs well compared to others on a diverse set of domains consistently. Our findings provide actionable insights into the performance trade-offs and effectiveness of different backbones, facilitating informed decision-making in model selection for a broad spectrum of computer vision domains. Our code is available here: https://github.com/pranavphoenix/Backbones

Problem

Research questions and friction points this paper is trying to address.

Evaluates lightweight CNN backbones across diverse domains.

Compares performance of backbones on small datasets.

Identifies best backbones for specific computer vision tasks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematically evaluates lightweight CNN backbones

Compares performance across diverse image domains

Identifies best backbones for small datasets

🔎 Similar Papers

MambaVision: A Hybrid Mamba-Transformer Vision Backbone