LAION-C: An Out-of-Distribution Benchmark for Web-Scale Vision Models

📅 2025-06-20

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

Existing out-of-distribution (OOD) benchmarks—e.g., ImageNet-C—suffer from severe data leakage, as their synthetic corruptions frequently appear in web-scale training data, thus failing to reliably assess true OOD robustness. Method: We introduce LAION-C, the first OOD benchmark explicitly designed for web-scale vision models. Grounded in empirical analysis of the LAION data distribution, it comprises six novel, human-perceptually calibrated corruptions, jointly engineered via image degradation modeling and psychophysical experiments to ensure both statistical fidelity to real-world distribution shifts and perceptual discriminability. We systematically evaluate a diverse suite of large multimodal models, including Gemini and GPT-4o. Results: Contemporary models exhibit substantial performance degradation on LAION-C, revealing genuine OOD bottlenecks. Notably, several models match or exceed human accuracy across multiple corruptions—marking a paradigm shift in OOD evaluation from synthetic, arbitrary distortions toward data-consistent, perception-aligned benchmarks.

Technology Category

Application Category

📝 Abstract

Out-of-distribution (OOD) robustness is a desired property of computer vision models. Improving model robustness requires high-quality signals from robustness benchmarks to quantify progress. While various benchmark datasets such as ImageNet-C were proposed in the ImageNet era, most ImageNet-C corruption types are no longer OOD relative to today's large, web-scraped datasets, which already contain common corruptions such as blur or JPEG compression artifacts. Consequently, these benchmarks are no longer well-suited for evaluating OOD robustness in the era of web-scale datasets. Indeed, recent models show saturating scores on ImageNet-era OOD benchmarks, indicating that it is unclear whether models trained on web-scale datasets truly become better at OOD generalization or whether they have simply been exposed to the test distortions during training. To address this, we introduce LAION-C as a benchmark alternative for ImageNet-C. LAION-C consists of six novel distortion types specifically designed to be OOD, even for web-scale datasets such as LAION. In a comprehensive evaluation of state-of-the-art models, we find that the LAION-C dataset poses significant challenges to contemporary models, including MLLMs such as Gemini and GPT-4o. We additionally conducted a psychophysical experiment to evaluate the difficulty of our corruptions for human observers, enabling a comparison of models to lab-quality human robustness data. We observe a paradigm shift in OOD generalization: from humans outperforming models, to the best models now matching or outperforming the best human observers.

Problem

Research questions and friction points this paper is trying to address.

Evaluating OOD robustness in web-scale vision models

Addressing limitations of ImageNet-C for modern datasets

Introducing LAION-C as a novel OOD benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces LAION-C benchmark for OOD robustness

Designs six novel OOD distortion types

Compares model robustness with human performance

🔎 Similar Papers

LMOD: A Large Multimodal Ophthalmology Dataset and Benchmark for Large Vision-Language Models