Beyond Accuracy: What Matters in Designing Well-Behaved Models?

📅 2025-03-21

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Deep learning models for image classification often compromise critical quality dimensions—including robustness, calibration, fairness, and domain generalization. Method: We conduct the first systematic evaluation of 326 models across nine quality dimensions, analyzing how training paradigms, architectures, and data scale influence “model well-behavedness.” We propose QUBA, a cross-dimensional composite metric, and establish the first benchmark comprehensively covering these nine dimensions. Contribution/Results: Our analysis reveals that vision-language models significantly outperform conventional models in fairness and domain robustness; self-supervised learning synergistically enhances multiple quality dimensions; and data scale is empirically confirmed as the primary driver of quality improvement. We further introduce a user-centric, multi-objective model selection framework that enables practitioners to identify optimal models based on specific quality requirements. This work provides foundational insights and practical tools for advancing holistic model evaluation and deployment in real-world applications.

Technology Category

Application Category

📝 Abstract

Deep learning has become an essential part of computer vision, with deep neural networks (DNNs) excelling in predictive performance. However, they often fall short in other critical quality dimensions, such as robustness, calibration, or fairness. While existing studies have focused on a subset of these quality dimensions, none have explored a more general form of"well-behavedness"of DNNs. With this work, we address this gap by simultaneously studying nine different quality dimensions for image classification. Through a large-scale study, we provide a bird's-eye view by analyzing 326 backbone models and how different training paradigms and model architectures affect the quality dimensions. We reveal various new insights such that (i) vision-language models exhibit high fairness on ImageNet-1k classification and strong robustness against domain changes; (ii) self-supervised learning is an effective training paradigm to improve almost all considered quality dimensions; and (iii) the training dataset size is a major driver for most of the quality dimensions. We conclude our study by introducing the QUBA score (Quality Understanding Beyond Accuracy), a novel metric that ranks models across multiple dimensions of quality, enabling tailored recommendations based on specific user needs.

Problem

Research questions and friction points this paper is trying to address.

Exploring general well-behavedness of DNNs beyond accuracy

Studying nine quality dimensions in image classification models

Introducing QUBA score for multi-dimensional model evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Simultaneously studies nine quality dimensions

Analyzes 326 backbone models comprehensively

Introduces QUBA score for multi-dimensional ranking

🔎 Similar Papers

Calibration in Deep Learning: A Survey of the State-of-the-Art