🤖 AI Summary
This work addresses the limited real-world coverage of existing out-of-distribution (OOD) detection evaluation protocols. We propose the first comprehensive evaluation framework that jointly accounts for semantic shift and covariate shift, and introduce three novel, practically motivated benchmarks: ImageNet-X (semantic shift), ImageNet-FS-X (fine-grained semantic shift), and Wilds-FS-X (real-world distribution shift combined with fine-grained semantic shift). To our knowledge, this is the first systematic assessment of vision-language models—particularly CLIP-based methods—under diverse distributional shifts. Extensive experiments reveal that state-of-the-art CLIP-based OOD detectors exhibit consistent performance degradation across all three benchmarks and demonstrate no cross-shift robustness, confirming the discriminative power and challenge level of our benchmarks. Our framework establishes a more realistic, rigorous, and reproducible evaluation paradigm for OOD detection research.
📝 Abstract
Out-of-distribution (OOD) detection is a task that detects OOD samples during inference to ensure the safety of deployed models. However, conventional benchmarks have reached performance saturation, making it difficult to compare recent OOD detection methods. To address this challenge, we introduce three novel OOD detection benchmarks that enable a deeper understanding of method characteristics and reflect real-world conditions. First, we present ImageNet-X, designed to evaluate performance under challenging semantic shifts. Second, we propose ImageNet-FS-X for full-spectrum OOD detection, assessing robustness to covariate shifts (feature distribution shifts). Finally, we propose Wilds-FS-X, which extends these evaluations to real-world datasets, offering a more comprehensive testbed. Our experiments reveal that recent CLIP-based OOD detection methods struggle to varying degrees across the three proposed benchmarks, and none of them consistently outperforms the others. We hope the community goes beyond specific benchmarks and includes more challenging conditions reflecting real-world scenarios. The code is https://github.com/hoshi23/OOD-X-Banchmarks.