One Single Hub Text Breaks CLIP: Identifying Vulnerabilities in Cross-Modal Encoders via Hubness

📅 2026-04-30
📈 Citations: 0
Influential: 0
📄 PDF

career value

210K/year
🤖 AI Summary
This work systematically uncovers a critical vulnerability in cross-modal embedding spaces: the hubness phenomenon, wherein certain text embeddings exhibit spuriously high similarity to a large number of irrelevant images, severely undermining the reliability of models like CLIP in image–text retrieval and evaluation tasks. The authors propose a novel method for detecting such "hub" texts through embedding space analysis and validate it on standard benchmarks including MSCOCO, nocaps, and Flickr30k. Their findings reveal that a single hub text can achieve similarity scores on most images that rival or even exceed those of human-annotated reference captions. These results expose a structural flaw in current cross-modal encoders, challenging the fundamental trustworthiness of their similarity-based assessments.
📝 Abstract
The hubness problem, in which hub embeddings are close to many unrelated examples, occurs often in high-dimensional embedding spaces and may pose a practical threat for purposes such as information retrieval and automatic evaluation metrics. In particular, since cross-modal similarity between text and images cannot be calculated by direct comparisons, such as string matching, cross-modal encoders that project different modalities into a shared space are helpful for various cross-modal applications, and thus, the existence of hubs may pose practical threats. To reveal the vulnerabilities of cross-modal encoders, we propose a method for identifying the hub embedding and its corresponding hub text. Experiments on image captioning evaluation in MSCOCO and nocaps along with image-to-text retrieval tasks in MSCOCO and Flickr30k showed that our method can identify a single hub text that unreasonably achieves comparable or higher similarity scores than human-written reference captions in many images, thereby revealing the vulnerabilities in cross-modal encoders.
Problem

Research questions and friction points this paper is trying to address.

hubness
cross-modal encoders
embedding vulnerability
image-text similarity
multimodal representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

hubness
cross-modal encoder
CLIP vulnerability
embedding space
image-text retrieval
🔎 Similar Papers