UME-R1: Exploring Reasoning-Driven Generative Multimodal Embeddings

📅 2025-11-01

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing discriminative multimodal embedding models struggle to leverage the reasoning capabilities of large language models. To address this, we propose a novel generative multimodal embedding paradigm and introduce the UME-R1 framework: (1) a supervised fine-tuning stage to align multimodal representations, followed by (2) a reinforcement learning stage augmented with inference-time repeated sampling, jointly optimizing discriminative and generative objectives. We are the first to uncover the complementary mechanisms between generative and discriminative embeddings, enabling reasoning-driven embedding learning. Evaluated on the MMEB-V2 benchmark across 78 diverse tasks, UME-R1 comprehensively outperforms discriminative baselines—achieving substantial gains in downstream task coverage, cross-task generalization, and embedding interpretability. Our approach establishes a unified solution for multimodal embedding that seamlessly integrates strong reasoning capabilities with generative flexibility.

Technology Category

Application Category

📝 Abstract

The remarkable success of multimodal large language models (MLLMs) has driven advances in multimodal embeddings, yet existing models remain inherently discriminative, limiting their ability to benefit from reasoning-driven generation paradigm. In this work, we pioneer the exploration of generative embeddings, unifying embedding tasks within a generative paradigm. We propose UME-R1, a universal multimodal embedding framework consisting of a two-stage training strategy: a cold-start supervised fine-tuning equips the model with reasoning capabilities and enables it to generate both discriminative and generative embeddings; a subsequent reinforcement learning enhances reasoning and further optimizes generative embedding quality. This pioneering work reveals four key insights: 1) generative embeddings unlock substantial performance gains over conventional discriminative embeddings by leveraging the powerful generative reasoning capabilities of MLLMs; 2) discriminative and generative embeddings are complementary, whose combined oracle performance far exceeding that of either alone; 3) RL can effectively enhance generative embeddings, establishing a scalable optimization paradigm.; 4) repeated sampling at inference boosts downstream task coverage (pass@k), highlighting the inference-time scalability potential of generative embeddings. Evaluated on the MMEB-V2 benchmark across 78 tasks spanning video, image, and visual documents, UME-R1 significantly outperforms conventional discriminative embedding models and offers a foundation for more interpretable, reasoning-driven generative multimodal embeddings. Our code, models, and datasets will be publicly available at https://github.com/XMUDeepLIT/UME-R1.

Problem

Research questions and friction points this paper is trying to address.

Unifying multimodal embedding tasks within a generative reasoning paradigm

Enhancing embedding quality through reinforcement learning optimization

Overcoming limitations of discriminative models via generative capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifying multimodal embeddings within generative paradigm

Two-stage training with fine-tuning and reinforcement learning

Generative embeddings leverage reasoning for performance gains

🔎 Similar Papers

ACE: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling