Mind the Way You Select Negative Texts: Pursuing the Distance Consistency in OOD Detection with VLMs

πŸ“… 2026-03-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses a critical limitation in existing vision-language model (VLM)-based out-of-distribution (OOD) detection methods, where the selection of negative samples suffers from misalignment between intra-modal distances and the model’s optimization objective, thereby constraining performance. To resolve this, we propose InterNeg, a novel framework that introduces cross-modal distance consistency into negative sample construction for the first time. Specifically, on the textual side, negative samples are selected according to a cross-modal criterion, while on the visual side, high-confidence OOD images are inverted into text embeddings to generate additional negative samples. This approach aligns the VLM training objective with the OOD detection mechanism, achieving state-of-the-art results: a 3.47% reduction in FPR95 on ImageNet and a 5.50% improvement in AUROC on Near-OOD benchmarks.

Technology Category

Application Category

πŸ“ Abstract
Out-of-distribution (OOD) detection seeks to identify samples from unknown classes, a critical capability for deploying machine learning models in open-world scenarios. Recent research has demonstrated that Vision-Language Models (VLMs) can effectively leverage their multi-modal representations for OOD detection. However, current methods often incorporate intra-modal distance during OOD detection, such as comparing negative texts with ID labels or comparing test images with image proxies. This design paradigm creates an inherent inconsistency against the inter-modal distance that CLIP-like VLMs are optimized for, potentially leading to suboptimal performance. To address this limitation, we propose InterNeg, a simple yet effective framework that systematically utilizes consistent inter-modal distance enhancement from textual and visual perspectives. From the textual perspective, we devise an inter-modal criterion for selecting negative texts. From the visual perspective, we dynamically identify high-confidence OOD images and invert them into the textual space, generating extra negative text embeddings guided by inter-modal distance. Extensive experiments across multiple benchmarks demonstrate the superiority of our approach. Notably, our InterNeg achieves state-of-the-art performance compared to existing works, with a 3.47\% reduction in FPR95 on the large-scale ImageNet benchmark and a 5.50\% improvement in AUROC on the challenging Near-OOD benchmark.
Problem

Research questions and friction points this paper is trying to address.

Out-of-distribution detection
Vision-Language Models
Inter-modal distance
Negative text selection
Distance consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

inter-modal distance consistency
negative text selection
vision-language models
out-of-distribution detection
text-to-image inversion
πŸ”Ž Similar Papers
No similar papers found.
Z
Zhikang Xu
Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences
Q
Qianqian Xu
State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences
Zitai Wang
Zitai Wang
Institute of Computing Technology, Chinese Academy of Sciences
Machine learningData miningAUC optimization
Cong Hua
Cong Hua
Institute of Computing Technology, Chinese Academy of Sciences
Machine Learning
S
Sicong Li
Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences
Zhiyong Yang
Zhiyong Yang
Professor of Marketing, Miami University
Cross-cultural ResearchConsumer PsychologyFamily Decision-MakingConsumer Socialization
Qingming Huang
Qingming Huang
University of the Chinese Academy of Sciences
Multimedia Analysis and RetrievalImage and Video ProcessingPattern RecognitionComputer VisionVideo Coding