🤖 AI Summary
Evaluating model robustness under distribution shift remains challenging due to uncontrolled group structure in existing benchmarks. Method: We propose a controllable group-structure modeling paradigm, introducing the Stylized Meta-Album (SMA) meta-dataset—comprising 24 image classification subsets (12 content-based + 12 stylized)—with cross-domain style injection via AdaIN/StyleGAN variants to explicitly induce tunable group biases. Contribution/Results: We pioneer configurable group-structure design enabling large-scale, fine-grained group diversity modeling; introduce Top-M worst-group accuracy as a hyperparameter optimization metric, substantially improving fairness and robustness evaluation under high-complexity group settings. Experiments on OOD fairness benchmarks reveal dramatic shifts in algorithm rankings with varying group diversity. Uncertainty quantification shows UDA error bars reduced by 73% (closed-set) and 28% (UniDA), demonstrating that controllability of group structure fundamentally affects evaluation validity and conclusions.
📝 Abstract
We introduce Stylized Meta-Album (SMA), a new image classification meta-dataset comprising 24 datasets (12 content datasets, and 12 stylized datasets), designed to advance studies on out-of-distribution (OOD) generalization and related topics. Created using style transfer techniques from 12 subject classification datasets, SMA provides a diverse and extensive set of 4800 groups, combining various subjects (objects, plants, animals, human actions, textures) with multiple styles. SMA enables flexible control over groups and classes, allowing us to configure datasets to reflect diverse benchmark scenarios. While ideally, data collection would capture extensive group diversity, practical constraints often make this infeasible. SMA addresses this by enabling large and configurable group structures through flexible control over styles, subject classes, and domains-allowing datasets to reflect a wide range of real-world benchmark scenarios. This design not only expands group and class diversity, but also opens new methodological directions for evaluating model performance across diverse group and domain configurations-including scenarios with many minority groups, varying group imbalance, and complex domain shifts-and for studying fairness, robustness, and adaptation under a broader range of realistic conditions. To demonstrate SMA's effectiveness, we implemented two benchmarks: (1) a novel OOD generalization and group fairness benchmark leveraging SMA's domain, class, and group diversity to evaluate existing benchmarks. Our findings reveal that while simple balancing and algorithms utilizing group information remain competitive as claimed in previous benchmarks, increasing group diversity significantly impacts fairness, altering the superiority and relative rankings of algorithms. We also propose to use extit{Top-M worst group accuracy} as a new hyperparameter tuning metric, demonstrating broader fairness during optimization and delivering better final worst-group accuracy for larger group diversity. (2) An unsupervised domain adaptation (UDA) benchmark utilizing SMA's group diversity to evaluate UDA algorithms across more scenarios, offering a more comprehensive benchmark with lower error bars (reduced by 73% and 28% in closed-set setting and UniDA setting, respectively) compared to existing efforts. These use cases highlight SMA's potential to significantly impact the outcomes of conventional benchmarks.