MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs

📅 2024-11-04

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 2

career value

191K/year

🤖 AI Summary

This work addresses the limitations of existing information retrieval models—namely, modality singularity and task inflexibility—by proposing a general-purpose multimodal retrieval framework capable of handling text-only, image-only, and text-image hybrid queries, while unifying diverse retrieval tasks. Methodologically: (1) a modality-aware hard negative mining mechanism is introduced to mitigate modality bias in multimodal large language models (MLLMs); (2) a continual fine-tuning strategy is designed to jointly preserve strong multimodal and pure-text retrieval capabilities; and (3) prompt-driven zero-shot reranking is explored to enhance robustness for complex queries. The system synergistically optimizes a dual-encoder architecture with MLLMs. Evaluated on the M-BEIR multimodal retrieval benchmark, it achieves state-of-the-art performance; it also surpasses NV-Embed-v1 on the MTEB text retrieval benchmark, with substantial gains in accuracy for text-image hybrid queries.

Technology Category

Application Category

📝 Abstract

State-of-the-art retrieval models typically address a straightforward search scenario, in which retrieval tasks are fixed (e.g., finding a passage to answer a specific question) and only a single modality is supported for both queries and retrieved results. This paper introduces techniques for advancing information retrieval with multimodal large language models (MLLMs), enabling a broader search scenario, termed universal multimodal retrieval, where multiple modalities and diverse retrieval tasks are accommodated. To this end, we first study fine-tuning an MLLM as a bi-encoder retriever on 10 datasets with 16 retrieval tasks. Our empirical results show that the fine-tuned MLLM retriever is capable of understanding challenging queries, composed of both text and image, but it underperforms compared to a smaller CLIP retriever in cross-modal retrieval tasks due to the modality bias exhibited by MLLMs. To address the issue, we propose modality-aware hard negative mining to mitigate the modality bias exhibited by MLLM retrievers. Second, we propose continuously fine-tuning the universal multimodal retriever to enhance its text retrieval capability while preserving multimodal retrieval capability. As a result, our model, MM-Embed, achieves state-of-the-art performance on the multimodal retrieval benchmark M-BEIR, which spans multiple domains and tasks, while also surpassing the state-of-the-art text retrieval model, NV-Embed-v1, on the MTEB retrieval benchmark. We also explore prompting the off-the-shelf MLLMs as zero-shot rerankers to refine the ranking of the candidates from the multimodal retriever. We find that, through prompt-and-reranking, MLLMs can further improve multimodal retrieval when the user queries (e.g., text-image composed queries) are more complex and challenging to understand. These findings also pave the way for advancing universal multimodal retrieval in the future.

Problem

Research questions and friction points this paper is trying to address.

Advancing multimodal retrieval with MLLMs

Mitigating modality bias in retrieval tasks

Enhancing text and multimodal retrieval performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

MLLM fine-tuning for multimodal retrieval

Modality-aware hard negative mining

Continuous fine-tuning for universal retrieval

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs