U-MARVEL: Unveiling Key Factors for Universal Multimodal Retrieval via Embedding Learning with MLLMs

📅 2025-07-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language model (MLLM)-based universal multimodal retrieval (UMR) methods predominantly rely on contrastive learning, yet their embedding learning mechanisms remain poorly understood, resulting in limited generalization and suboptimal performance. This paper identifies several underexplored training factors—progressive transfer, hard negative mining, and re-ranker distillation—as critical determinants of embedding quality. We propose the first unified MLLM-driven UMR framework that jointly integrates contrastive learning, progressive transition training, hard negative mining, and re-ranker distillation. On the M-BEIR benchmark, our method significantly outperforms state-of-the-art approaches under supervised settings. Moreover, it demonstrates strong zero-shot transfer capability on challenging tasks—including compositional image retrieval and text-to-video retrieval—validating its robustness and generalizability across diverse multimodal retrieval scenarios.

Technology Category

Application Category

📝 Abstract
Universal multimodal retrieval (UMR), which aims to address complex retrieval tasks where both queries and candidates span diverse modalities, has been significantly advanced by the emergence of MLLMs. While state-of-the-art MLLM-based methods in the literature predominantly adopt contrastive learning principles, they often differ in their specific training recipes. Despite their success, the mechanisms underlying their retrieval capabilities remain largely unexplored, potentially resulting in suboptimal performance and limited generalization ability. To address these issues, we present a comprehensive study aimed at uncovering the key factors that drive effective embedding learning for UMR using MLLMs. We begin by implementing a general MLLM-based embedding learning pipeline, and systematically analyze the primary contributors to high-performing universal retrieval systems. Based on this, we explore various aspects of the details in embedding generation and training strategies, including progressive transition, hard negative mining and re-ranker distillation. Notably, our findings reveal that often-overlooked factors can have a substantial impact on model performance. Building on these discoveries, we introduce a unified framework termed U-MARVEL ( extbf{U}niversal extbf{M}ultimod extbf{A}l extbf{R}etrie extbf{V}al via extbf{E}mbedding extbf{L}earning), which outperforms state-of-the-art competitors on the M-BEIR benchmark by a large margin in supervised settings, and also exihibits strong zero-shot performance on several tasks such as composed image retrieval and text-to-video retrieval. These results underscore the generalization potential of our framework across various embedding-based retrieval tasks. Code is available at https://github.com/chaxjli/U-MARVEL
Problem

Research questions and friction points this paper is trying to address.

Identify key factors for effective multimodal retrieval embedding learning
Analyze overlooked aspects in embedding generation and training strategies
Propose a unified framework to enhance retrieval performance and generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive transition in embedding generation
Hard negative mining for training
Re-ranker distillation strategy
🔎 Similar Papers
No similar papers found.