Recent Advances in Out-of-Distribution Detection with CLIP-Like Models: A Survey

📅 2025-05-05

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This paper addresses the out-of-distribution (OOD) detection challenge in vision-language models (e.g., CLIP). To this end, it proposes the first taxonomy specifically designed for multimodal OOD detection—orthogonally categorizing methods along two dimensions: OOD images (seen vs. unseen) and OOD texts (known vs. unknown), further distinguishing training-free from training-dependent strategies. Leveraging vision-language pretrained models, the work systematically reviews over 100 approaches through the lenses of cross-modal alignment analysis, zero-/few-shot learning theory, and standardized OOD evaluation protocols. Key limitations are identified: weak cross-domain generalization, low deployment efficiency, and insufficient theoretical interpretability. The study delivers a structured research roadmap and a unified analytical benchmark for multimodal OOD detection, enabling principled comparison and guiding future advances in robust, scalable, and interpretable multimodal OOD methods.

Technology Category

Application Category

📝 Abstract

Out-of-distribution detection (OOD) is a pivotal task for real-world applications that trains models to identify samples that are distributionally different from the in-distribution (ID) data during testing. Recent advances in AI, particularly Vision-Language Models (VLMs) like CLIP, have revolutionized OOD detection by shifting from traditional unimodal image detectors to multimodal image-text detectors. This shift has inspired extensive research; however, existing categorization schemes (e.g., few- or zero-shot types) still rely solely on the availability of ID images, adhering to a unimodal paradigm. To better align with CLIP's cross-modal nature, we propose a new categorization framework rooted in both image and text modalities. Specifically, we categorize existing methods based on how visual and textual information of OOD data is utilized within image + text modalities, and further divide them into four groups: OOD Images (i.e., outliers) Seen or Unseen, and OOD Texts (i.e., learnable vectors or class names) Known or Unknown, across two training strategies (i.e., train-free or training-required). More importantly, we discuss open problems in CLIP-like OOD detection and highlight promising directions for future research, including cross-domain integration, practical applications, and theoretical understanding.

Problem

Research questions and friction points this paper is trying to address.

Detecting out-of-distribution samples using multimodal models

Categorizing OOD methods based on image and text modalities

Addressing open problems in CLIP-like OOD detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal image-text detectors replace unimodal approaches

New categorization framework uses both image and text

Four groups classify OOD data by visual and textual info

🔎 Similar Papers

Generalized Out-of-Distribution Detection and Beyond in Vision Language Model Era: A Survey