🤖 AI Summary
Existing out-of-distribution (OOD) benchmarks—e.g., ImageNet-C—suffer from severe data leakage, as their synthetic corruptions frequently appear in web-scale training data, thus failing to reliably assess true OOD robustness. Method: We introduce LAION-C, the first OOD benchmark explicitly designed for web-scale vision models. Grounded in empirical analysis of the LAION data distribution, it comprises six novel, human-perceptually calibrated corruptions, jointly engineered via image degradation modeling and psychophysical experiments to ensure both statistical fidelity to real-world distribution shifts and perceptual discriminability. We systematically evaluate a diverse suite of large multimodal models, including Gemini and GPT-4o. Results: Contemporary models exhibit substantial performance degradation on LAION-C, revealing genuine OOD bottlenecks. Notably, several models match or exceed human accuracy across multiple corruptions—marking a paradigm shift in OOD evaluation from synthetic, arbitrary distortions toward data-consistent, perception-aligned benchmarks.
📝 Abstract
Out-of-distribution (OOD) robustness is a desired property of computer vision models. Improving model robustness requires high-quality signals from robustness benchmarks to quantify progress. While various benchmark datasets such as ImageNet-C were proposed in the ImageNet era, most ImageNet-C corruption types are no longer OOD relative to today's large, web-scraped datasets, which already contain common corruptions such as blur or JPEG compression artifacts. Consequently, these benchmarks are no longer well-suited for evaluating OOD robustness in the era of web-scale datasets. Indeed, recent models show saturating scores on ImageNet-era OOD benchmarks, indicating that it is unclear whether models trained on web-scale datasets truly become better at OOD generalization or whether they have simply been exposed to the test distortions during training. To address this, we introduce LAION-C as a benchmark alternative for ImageNet-C. LAION-C consists of six novel distortion types specifically designed to be OOD, even for web-scale datasets such as LAION. In a comprehensive evaluation of state-of-the-art models, we find that the LAION-C dataset poses significant challenges to contemporary models, including MLLMs such as Gemini and GPT-4o. We additionally conducted a psychophysical experiment to evaluate the difficulty of our corruptions for human observers, enabling a comparison of models to lab-quality human robustness data. We observe a paradigm shift in OOD generalization: from humans outperforming models, to the best models now matching or outperforming the best human observers.