🤖 AI Summary
Current out-of-distribution (OOD) generalization evaluation is vulnerable to test-domain contamination, leading to inflated performance estimates for large vision-language models. Method: We introduce LAION-Natural and LAION-Rendition—the first large-scale, strictly style-isolated OOD benchmark—enabling Web-scale, style-pure evaluation separating natural photographs from synthetic renderings. Leveraging CLIP-based models, we conduct attribution analysis and systematic ablation studies with controlled domain mixing. Results: We expose significant overestimation of OOD generalization in conventional ImageNet-style benchmarks; demonstrate that Web-scale pretraining exacerbates this illusion; confirm that models remain heavily reliant on in-distribution samples; and identify a 1:1 natural-to-rendition mixing ratio that consistently improves cross-domain accuracy by 3.2%. This work delivers a reproducible benchmark, reveals fundamental bottlenecks in real-world OOD robustness of foundation models, and provides empirically grounded guidance for data-mixing strategies.
📝 Abstract
Out-of-Domain (OOD) generalization is the ability of a model trained on one or more domains to generalize to unseen domains. In the ImageNet era of computer vision, evaluation sets for measuring a model's OOD performance were designed to be strictly OOD with respect to style. However, the emergence of foundation models and expansive web-scale datasets has obfuscated this evaluation process, as datasets cover a broad range of domains and risk test domain contamination. In search of the forgotten domain generalization, we create large-scale datasets subsampled from LAION -- LAION-Natural and LAION-Rendition -- that are strictly OOD to corresponding ImageNet and DomainNet test sets in terms of style. Training CLIP models on these datasets reveals that a significant portion of their performance is explained by in-domain examples. This indicates that the OOD generalization challenges from the ImageNet era still prevail and that training on web-scale data merely creates the illusion of OOD generalization. Furthermore, through a systematic exploration of combining natural and rendition datasets in varying proportions, we identify optimal mixing ratios for model generalization across these domains. Our datasets and results re-enable meaningful assessment of OOD robustness at scale -- a crucial prerequisite for improving model robustness.