Generalized Contrastive Learning for Universal Multimodal Retrieval

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Cross-modal retrieval models (e.g., CLIP) suffer significant performance degradation when handling fused multimodal queries—such as Wikipedia pages combining text and images—and existing approaches rely on manually constructed image–text triplets, limiting generalizability and scalability. This paper proposes a universal multimodal retrieval framework trained end-to-end on standard image–text paired data via a generalized contrastive learning (GCL) loss, eliminating the need for data reconstruction or custom triplet design. GCL unifies modeling of arbitrary modality combinations within each mini-batch, enabling joint optimization toward a shared multimodal embedding space and substantially improving zero-shot generalization to unseen modality compositions. Evaluated on VISTA, CLIP, and TinyCLIP backbones, our method consistently improves retrieval performance across benchmarks—including M-BEIR, MMEB, and CoVR—demonstrating its universality, effectiveness, and deployment readiness.

Technology Category

Application Category

📝 Abstract

Despite their consistent performance improvements, cross-modal retrieval models (e.g., CLIP) show degraded performances with retrieving keys composed of fused image-text modality (e.g., Wikipedia pages with both images and text). To address this critical challenge, multimodal retrieval has been recently explored to develop a unified single retrieval model capable of retrieving keys across diverse modality combinations. A common approach involves constructing new composed sets of image-text triplets (e.g., retrieving a pair of image and text given a query image). However, such an approach requires careful curation to ensure the dataset quality and fails to generalize to unseen modality combinations. To overcome these limitations, this paper proposes Generalized Contrastive Learning (GCL), a novel loss formulation that improves multimodal retrieval performance without the burdensome need for new dataset curation. Specifically, GCL operates by enforcing contrastive learning across all modalities within a mini-batch, utilizing existing image-caption paired datasets to learn a unified representation space. We demonstrate the effectiveness of GCL by showing consistent performance improvements on off-the-shelf multimodal retrieval models (e.g., VISTA, CLIP, and TinyCLIP) using the M-BEIR, MMEB, and CoVR benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Improving retrieval of fused image-text multimodal keys

Eliminating need for curated multimodal triplet datasets

Generalizing retrieval to unseen modality combinations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generalized Contrastive Learning across all modalities

Uses existing datasets without new curation

Learns unified representation space for retrieval

🔎 Similar Papers

Generalized Contrastive Learning for Multi-Modal Retrieval and Ranking