π€ AI Summary
The high ImageNet accuracy of lightweight vision models (e.g., MobileNet, ShuffleNet, EfficientNet) lacks systematic validation for cross-domain robustness.
Method: We conduct a unified training and evaluation of 11 state-of-the-art lightweight models across seven heterogeneous datasets to assess generalization beyond ImageNet. To address the weak correlation between ImageNet accuracy and cross-domain performance, we propose xScoreβa lightweight, interpretable cross-dataset metric requiring only four source datasets to predict model generalization capability. We further analyze architectural factors influencing transferability.
Contribution/Results: We establish the first reproducible benchmark for evaluating cross-domain robustness of lightweight models. Empirical analysis reveals that isotropic convolutions and channel-wise attention enhance transferability, whereas Transformer modules yield limited gains under resource constraints. Our findings provide empirical guidance for designing efficient mobile vision architectures.
π Abstract
Lightweight vision classification models such as MobileNet, ShuffleNet, and EfficientNet are increasingly deployed in mobile and embedded systems, yet their performance has been predominantly benchmarked on ImageNet. This raises critical questions: Do models that excel on ImageNet also generalize across other domains? How can cross-dataset robustness be systematically quantified? And which architectural elements consistently drive generalization under tight resource constraints? Here, we present the first systematic evaluation of 11 lightweight vision models (2.5M parameters), trained under a fixed 100-epoch schedule across 7 diverse datasets. We introduce the Cross-Dataset Score (xScore), a unified metric that quantifies the consistency and robustness of model performance across diverse visual domains. Our results show that (1) ImageNet accuracy does not reliably predict performance on fine-grained or medical datasets, (2) xScore provides a scalable predictor of mobile model performance that can be estimated from just four datasets, and (3) certain architectural components--such as isotropic convolutions with higher spatial resolution and channel-wise attention--promote broader generalization, while Transformer-based blocks yield little additional benefit, despite incurring higher parameter overhead. This study provides a reproducible framework for evaluating lightweight vision models beyond ImageNet, highlights key design principles for mobile-friendly architectures, and guides the development of future models that generalize robustly across diverse application domains.