Closing the Modality Gap for Mixed Modality Search

📅 2025-07-25

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

In mixed-modal retrieval, significant modality gaps between image and text embeddings in the CLIP embedding space lead to intra-modal ranking bias and failure of cross-modal fusion. To address this, we propose GR-CLIP—a lightweight, plug-and-play post-hoc calibration method that, for the first time, systematically eliminates modality distribution shifts in CLIP’s embedding space without altering model architecture or requiring retraining. GR-CLIP jointly leverages geometric remapping and contrastive consistency constraints to align modalities in the shared latent space. Evaluated on our newly constructed mixed-modal benchmark MixBench, GR-CLIP achieves up to a 26-percentage-point improvement in NDCG@10, outperforming state-of-the-art generative multimodal models by +4 points, while reducing computational overhead by 75×. This work establishes a new paradigm for efficient, deployable cross-modal alignment—enabling high-fidelity retrieval with minimal inference cost and zero model modification.

Technology Category

Application Category

📝 Abstract

Mixed modality search -- retrieving information across a heterogeneous corpus composed of images, texts, and multimodal documents -- is an important yet underexplored real-world application. In this work, we investigate how contrastive vision-language models, such as CLIP, perform on the mixed modality search task. Our analysis reveals a critical limitation: these models exhibit a pronounced modality gap in the embedding space, where image and text embeddings form distinct clusters, leading to intra-modal ranking bias and inter-modal fusion failure. To address this issue, we propose GR-CLIP, a lightweight post-hoc calibration method that removes the modality gap in CLIP's embedding space. Evaluated on MixBench -- the first benchmark specifically designed for mixed modality search -- GR-CLIP improves NDCG@10 by up to 26 percentage points over CLIP, surpasses recent vision-language generative embedding models by 4 percentage points, while using 75x less compute.

Problem

Research questions and friction points this paper is trying to address.

Addressing modality gap in mixed modality search

Improving cross-modal retrieval performance in CLIP

Reducing intra-modal bias and inter-modal fusion failure

Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive vision-language models for search

Post-hoc calibration to remove modality gap

Lightweight method improves ranking significantly

🔎 Similar Papers

No similar papers found.