🤖 AI Summary
This paper investigates the robustness degradation of machine learning models under concurrent distribution shifts—specifically, the co-occurrence of domain shift and spurious correlations. To this end, we establish a comprehensive benchmark spanning eight datasets, 168 source–target domain pairs, and 26 algorithms, involving over 100,000 model training and evaluation runs. We propose a multi-source–multi-target shift construction framework and a statistical attribution analysis methodology. Our large-scale empirical study is the first to systematically quantify the compounding effect of concurrent shifts; reveals positive cross-shift generalization transferability; and demonstrates that heuristic data augmentation consistently outperforms large-model zero-shot inference—achieving state-of-the-art average robustness on both synthetic and real-world benchmarks. Crucially, we identify a consistent cross-shift pattern in generalization improvement, providing both theoretical grounding and practical guidance for robust modeling in complex, realistic deployment scenarios.
📝 Abstract
Machine learning models, meticulously optimized for source data, often fail to predict target data when faced with distribution shifts (DSs). Previous benchmarking studies, though extensive, have mainly focused on simple DSs. Recognizing that DSs often occur in more complex forms in real-world scenarios, we broadened our study to include multiple concurrent shifts, such as unseen domain shifts combined with spurious correlations. We evaluated 26 algorithms that range from simple heuristic augmentations to zero-shot inference using foundation models, across 168 source-target pairs from eight datasets. Our analysis of over 100K models reveals that (i) concurrent DSs typically worsen performance compared to a single shift, with certain exceptions, (ii) if a model improves generalization for one distribution shift, it tends to be effective for others, and (iii) heuristic data augmentations achieve the best overall performance on both synthetic and real-world datasets.