Beyond ImageNet: Understanding Cross-Dataset Robustness of Lightweight Vision Models

πŸ“… 2025-10-31
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
The high ImageNet accuracy of lightweight vision models (e.g., MobileNet, ShuffleNet, EfficientNet) lacks systematic validation for cross-domain robustness. Method: We conduct a unified training and evaluation of 11 state-of-the-art lightweight models across seven heterogeneous datasets to assess generalization beyond ImageNet. To address the weak correlation between ImageNet accuracy and cross-domain performance, we propose xScoreβ€”a lightweight, interpretable cross-dataset metric requiring only four source datasets to predict model generalization capability. We further analyze architectural factors influencing transferability. Contribution/Results: We establish the first reproducible benchmark for evaluating cross-domain robustness of lightweight models. Empirical analysis reveals that isotropic convolutions and channel-wise attention enhance transferability, whereas Transformer modules yield limited gains under resource constraints. Our findings provide empirical guidance for designing efficient mobile vision architectures.

Technology Category

Application Category

πŸ“ Abstract
Lightweight vision classification models such as MobileNet, ShuffleNet, and EfficientNet are increasingly deployed in mobile and embedded systems, yet their performance has been predominantly benchmarked on ImageNet. This raises critical questions: Do models that excel on ImageNet also generalize across other domains? How can cross-dataset robustness be systematically quantified? And which architectural elements consistently drive generalization under tight resource constraints? Here, we present the first systematic evaluation of 11 lightweight vision models (2.5M parameters), trained under a fixed 100-epoch schedule across 7 diverse datasets. We introduce the Cross-Dataset Score (xScore), a unified metric that quantifies the consistency and robustness of model performance across diverse visual domains. Our results show that (1) ImageNet accuracy does not reliably predict performance on fine-grained or medical datasets, (2) xScore provides a scalable predictor of mobile model performance that can be estimated from just four datasets, and (3) certain architectural components--such as isotropic convolutions with higher spatial resolution and channel-wise attention--promote broader generalization, while Transformer-based blocks yield little additional benefit, despite incurring higher parameter overhead. This study provides a reproducible framework for evaluating lightweight vision models beyond ImageNet, highlights key design principles for mobile-friendly architectures, and guides the development of future models that generalize robustly across diverse application domains.
Problem

Research questions and friction points this paper is trying to address.

Evaluating lightweight vision models' generalization beyond ImageNet benchmarks
Quantifying cross-dataset robustness using unified metric xScore
Identifying architectural components that drive generalization under constraints
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduced Cross-Dataset Score metric for robustness
Evaluated lightweight models across seven diverse datasets
Identified architectural components promoting cross-domain generalization
πŸ”Ž Similar Papers
No similar papers found.
Weidong Zhang
Weidong Zhang
Samsung Research America
Computer VisionImage Processing
P
Pak Lun Kevin Ding
Department of Computer Science, Arizona State University
H
Huan Liu
Department of Computer Science, Arizona State University