One Single Hub Text Breaks CLIP: Identifying Vulnerabilities in Cross-Modal Encoders via Hubness

📅 2026-04-30

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work systematically uncovers a critical vulnerability in cross-modal embedding spaces: the hubness phenomenon, wherein certain text embeddings exhibit spuriously high similarity to a large number of irrelevant images, severely undermining the reliability of models like CLIP in image–text retrieval and evaluation tasks. The authors propose a novel method for detecting such "hub" texts through embedding space analysis and validate it on standard benchmarks including MSCOCO, nocaps, and Flickr30k. Their findings reveal that a single hub text can achieve similarity scores on most images that rival or even exceed those of human-annotated reference captions. These results expose a structural flaw in current cross-modal encoders, challenging the fundamental trustworthiness of their similarity-based assessments.

📝 Abstract

The hubness problem, in which hub embeddings are close to many unrelated examples, occurs often in high-dimensional embedding spaces and may pose a practical threat for purposes such as information retrieval and automatic evaluation metrics. In particular, since cross-modal similarity between text and images cannot be calculated by direct comparisons, such as string matching, cross-modal encoders that project different modalities into a shared space are helpful for various cross-modal applications, and thus, the existence of hubs may pose practical threats. To reveal the vulnerabilities of cross-modal encoders, we propose a method for identifying the hub embedding and its corresponding hub text. Experiments on image captioning evaluation in MSCOCO and nocaps along with image-to-text retrieval tasks in MSCOCO and Flickr30k showed that our method can identify a single hub text that unreasonably achieves comparable or higher similarity scores than human-written reference captions in many images, thereby revealing the vulnerabilities in cross-modal encoders.

Problem

Research questions and friction points this paper is trying to address.

hubness

cross-modal encoders

embedding vulnerability

image-text similarity

multimodal representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

hubness

cross-modal encoder

CLIP vulnerability