🤖 AI Summary
This study identifies a critical vulnerability of the CLIP text encoder to non-semantic textual perturbations in cross-modal retrieval. Method: Using the TRECVID Ad-Hoc queries and V3C1 video dataset, we systematically evaluate the impact of lexical, syntactic, and semantic perturbations—including punctuation, capitalization, word order, and synonym substitution—on cross-modal ranking stability. Contribution/Results: We find that superficial perturbations—particularly syntactic ones—induce severe performance degradation in CLIP variants, with Top-K accuracy dropping by over 20%, revealing extreme sensitivity to minor textual alterations. Crucially, this work introduces ranking stability as a core robustness metric for vision-language models, advocating its integration into standard evaluation protocols. It further establishes an empirical foundation and benchmarking framework for designing perturbation-resilient text encoders in multimodal retrieval.
📝 Abstract
Multimodal co-embedding models, especially CLIP, have advanced the state of the art in zero-shot classification and multimedia information retrieval in recent years by aligning images and text in a shared representation space. However, such modals trained on a contrastive alignment can lack stability towards small input perturbations. Especially when dealing with manually expressed queries, minor variations in the query can cause large differences in the ranking of the best-matching results. In this paper, we present a systematic analysis of the effect of multiple classes of non-semantic query perturbations in an multimedia information retrieval scenario. We evaluate a diverse set of lexical, syntactic, and semantic perturbations across multiple CLIP variants using the TRECVID Ad-Hoc Video Search queries and the V3C1 video collection. Across models, we find that syntactic and semantic perturbations drive the largest instabilities, while brittleness is concentrated in trivial surface edits such as punctuation and case. Our results highlight robustness as a critical dimension for evaluating vision-language models beyond benchmark accuracy.