From Generator to Embedder: Harnessing Innate Abilities of Multimodal LLMs via Building Zero-Shot Discriminative Embedding Model

📅 2025-08-01

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Multimodal large language models (MLLMs) suffer from a mismatch between generative and discriminative representation learning; conventional contrastive pretraining is computationally expensive and neglects MLLMs’ instruction-following capabilities. Method: We propose a contrastive-pretraining-free zero-shot embedding learning paradigm, comprising (1) a hierarchical embedding prompt template that repurposes generative MLLMs as discriminative embedding encoders, and (2) a self-aware hard negative sampling and filtering mechanism that dynamically identifies high-quality negatives during inference. Contribution/Results: This is the first work to fully unlock the intrinsic embedding potential of MLLMs. On the MMEB benchmark, our zero-shot embeddings outperform contrastive pretraining baselines; after fine-tuning, accuracy improves by 4.8 percentage points. The method achieves state-of-the-art, efficient, and generalizable cross-modal representation learning—without requiring contrastive pretraining.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) have emerged as a promising solution for universal embedding tasks, yet adapting their generative nature for discriminative representation learning remains a significant challenge. The dominant paradigm of large-scale contrastive pre-training suffers from critical inefficiencies, including prohibitive computational costs and a failure to leverage the intrinsic, instruction-following capabilities of MLLMs. To overcome these limitations, we propose an efficient framework for universal multimodal embeddings, which bridges this gap by centering on two synergistic components. First, our hierarchical embedding prompt template employs a two-level instruction architecture that forces the model to produce discriminative representations. Building on this strong foundation, our second component, self-aware hard negative sampling, redefines the fine-tuning process by leveraging the model's own understanding to efficiently mine challenging negatives while actively filtering out potential false negatives. Our comprehensive experiments show that our hierarchical prompt achieves zero-shot performance competitive with contrastively trained baselines and enhances the fine-tuning process by lifting a simple in-batch negative baseline by 4.8 points on the MMEB benchmark. We further boost the performance via our self-aware hard negative sampling, achieving the state-of-the-art performance without the contrative pre-training. Our work presents an effective and efficient pathway to adapt MLLMs for universal embedding tasks, significantly reducing training time.

Problem

Research questions and friction points this paper is trying to address.

Adapting generative MLLMs for discriminative representation learning

Overcoming inefficiencies in large-scale contrastive pre-training

Enhancing zero-shot performance without contrastive pre-training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical embedding prompt template for discriminative representations

Self-aware hard negative sampling for efficient fine-tuning

Zero-shot performance competitive with contrastive training baselines

🔎 Similar Papers

What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models