Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs

📅 2025-04-24

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

CLIP suffers from limited compositional reasoning due to text truncation, decoupled image–text encoders, and bag-of-words text modeling. To address these limitations, we propose UniME—a two-stage framework that transcends the CLIP paradigm. In Stage I, we enhance the language encoder via text-knowledge distillation to improve semantic fidelity. In Stage II, we perform instruction-tuned fine-tuning with hard negative mining and contrastive learning to strengthen cross-modal discriminability and structured compositionality. UniME leverages multimodal large language models (MLLMs) to construct a general-purpose, transferable unified embedding space. Evaluated on the MMEB benchmark and diverse downstream tasks—including long/short caption retrieval and compositional reasoning—UniME consistently outperforms CLIP and other baselines. Notably, it achieves the first simultaneous improvement in both discriminative and compositional representation capabilities, establishing a new state of the art in multimodal representation learning.

Technology Category

Application Category

📝 Abstract

The Contrastive Language-Image Pre-training (CLIP) framework has become a widely used approach for multimodal representation learning, particularly in image-text retrieval and clustering. However, its efficacy is constrained by three key limitations: (1) text token truncation, (2) isolated image-text encoding, and (3) deficient compositionality due to bag-of-words behavior. While recent Multimodal Large Language Models (MLLMs) have demonstrated significant advances in generalized vision-language understanding, their potential for learning transferable multimodal representations remains underexplored.In this work, we present UniME (Universal Multimodal Embedding), a novel two-stage framework that leverages MLLMs to learn discriminative representations for diverse downstream tasks. In the first stage, we perform textual discriminative knowledge distillation from a powerful LLM-based teacher model to enhance the embedding capability of the MLLM's language component. In the second stage, we introduce hard negative enhanced instruction tuning to further advance discriminative representation learning. Specifically, we initially mitigate false negative contamination and then sample multiple hard negatives per instance within each batch, forcing the model to focus on challenging samples. This approach not only improves discriminative power but also enhances instruction-following ability in downstream tasks. We conduct extensive experiments on the MMEB benchmark and multiple retrieval tasks, including short and long caption retrieval and compositional retrieval. Results demonstrate that UniME achieves consistent performance improvement across all tasks, exhibiting superior discriminative and compositional capabilities.

Problem

Research questions and friction points this paper is trying to address.

Overcoming text truncation and isolated encoding in CLIP

Enhancing multimodal representation learning with MLLMs

Improving discriminative and compositional capabilities in retrieval tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage MLLM framework for multimodal embedding

Textual discriminative knowledge distillation from LLM

Hard negative enhanced instruction tuning

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs