Mind the Way You Select Negative Texts: Pursuing the Distance Consistency in OOD Detection with VLMs

📅 2026-03-03

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

This work addresses a critical limitation in existing vision-language model (VLM)-based out-of-distribution (OOD) detection methods, where the selection of negative samples suffers from misalignment between intra-modal distances and the model’s optimization objective, thereby constraining performance. To resolve this, we propose InterNeg, a novel framework that introduces cross-modal distance consistency into negative sample construction for the first time. Specifically, on the textual side, negative samples are selected according to a cross-modal criterion, while on the visual side, high-confidence OOD images are inverted into text embeddings to generate additional negative samples. This approach aligns the VLM training objective with the OOD detection mechanism, achieving state-of-the-art results: a 3.47% reduction in FPR95 on ImageNet and a 5.50% improvement in AUROC on Near-OOD benchmarks.

Technology Category

Application Category

📝 Abstract

Out-of-distribution (OOD) detection seeks to identify samples from unknown classes, a critical capability for deploying machine learning models in open-world scenarios. Recent research has demonstrated that Vision-Language Models (VLMs) can effectively leverage their multi-modal representations for OOD detection. However, current methods often incorporate intra-modal distance during OOD detection, such as comparing negative texts with ID labels or comparing test images with image proxies. This design paradigm creates an inherent inconsistency against the inter-modal distance that CLIP-like VLMs are optimized for, potentially leading to suboptimal performance. To address this limitation, we propose InterNeg, a simple yet effective framework that systematically utilizes consistent inter-modal distance enhancement from textual and visual perspectives. From the textual perspective, we devise an inter-modal criterion for selecting negative texts. From the visual perspective, we dynamically identify high-confidence OOD images and invert them into the textual space, generating extra negative text embeddings guided by inter-modal distance. Extensive experiments across multiple benchmarks demonstrate the superiority of our approach. Notably, our InterNeg achieves state-of-the-art performance compared to existing works, with a 3.47\% reduction in FPR95 on the large-scale ImageNet benchmark and a 5.50\% improvement in AUROC on the challenging Near-OOD benchmark.

Problem

Research questions and friction points this paper is trying to address.

Out-of-distribution detection

Vision-Language Models

Inter-modal distance

Negative text selection

Distance consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

inter-modal distance consistency

negative text selection

vision-language models