LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning

📅 2025-03-04

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

Existing LMM embedding models suffer from overlapping similarity distributions between positive and negative samples during InfoNCE training, leading to weak discrimination of hard negatives. To address this, we propose a hardness-aware dynamic weighted contrastive learning framework: (1) a similarity-driven hardness evaluation mechanism adaptively strengthens learning from hard negatives; (2) multi-stage model scaling coupled with cross-modal zero-shot transfer design significantly improves embedding space separability. Evaluated on the MMEB benchmark—comprising four task categories and 36 datasets—our method achieves SOTA performance: LLaVE-2B surpasses prior 7B models, while LLaVE-7B further improves by 6.2 points. Moreover, it demonstrates strong zero-shot transfer capability to text-video retrieval. This work is the first to enable adaptive focus on hard negatives during training, establishing a novel paradigm for general-purpose vision-language joint embedding modeling.

Technology Category

Application Category

📝 Abstract

Universal multimodal embedding models play a critical role in tasks such as interleaved image-text retrieval, multimodal RAG, and multimodal clustering. However, our empirical results indicate that existing LMM-based embedding models trained with the standard InfoNCE loss exhibit a high degree of overlap in similarity distribution between positive and negative pairs, making it challenging to distinguish hard negative pairs effectively. To deal with this issue, we propose a simple yet effective framework that dynamically improves the embedding model's representation learning for negative pairs based on their discriminative difficulty. Within this framework, we train a series of models, named LLaVE, and evaluate them on the MMEB benchmark, which covers 4 meta-tasks and 36 datasets. Experimental results show that LLaVE establishes stronger baselines that achieve state-of-the-art (SOTA) performance while demonstrating strong scalability and efficiency. Specifically, LLaVE-2B surpasses the previous SOTA 7B models, while LLaVE-7B achieves a further performance improvement of 6.2 points. Although LLaVE is trained on image-text data, it can generalize to text-video retrieval tasks in a zero-shot manner and achieve strong performance, demonstrating its remarkable potential for transfer to other embedding tasks.

Problem

Research questions and friction points this paper is trying to address.

Improves multimodal embedding models for better image-text retrieval.

Addresses difficulty in distinguishing hard negative pairs effectively.

Enhances scalability and efficiency in universal multimodal embedding tasks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hardness-weighted contrastive learning for embeddings

Dynamic improvement of negative pair representation

Zero-shot generalization to text-video retrieval

🔎 Similar Papers

Law of Vision Representation in MLLMs