Think Then Embed: Generative Context Improves Multimodal Embedding

📅 2025-10-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing universal multimodal embedding (UME) models treat multimodal large language models (MLLMs) solely as encoders, neglecting their generative capabilities—leading to insufficient understanding of complex instructions and compositional reasoning. Method: We propose the Think-Then-Embed framework, introducing the first “reason-then-embed” paradigm: it explicitly decouples reasoning from embedding via chain-of-thought (CoT) prompting, leveraging high-quality generative reasoning trajectories to enrich semantic representations. We design lightweight yet high-performance reasoning and embedding modules, explore their unified architectural integration, and adopt efficient fine-tuning strategies using small models. Results: On the MMEB-V2 benchmark, our approach surpasses proprietary large models and achieves a 7-percentage-point absolute improvement over prior open-source methods, demonstrating both state-of-the-art performance and strong potential for efficient inference.

Technology Category

Application Category

📝 Abstract
There is a growing interest in Universal Multimodal Embeddings (UME), where models are required to generate task-specific representations. While recent studies show that Multimodal Large Language Models (MLLMs) perform well on such tasks, they treat MLLMs solely as encoders, overlooking their generative capacity. However, such an encoding paradigm becomes less effective as instructions become more complex and require compositional reasoning. Inspired by the proven effectiveness of chain-of-thought reasoning, we propose a general Think-Then-Embed (TTE) framework for UME, composed of a reasoner and an embedder. The reasoner MLLM first generates reasoning traces that explain complex queries, followed by an embedder that produces representations conditioned on both the original query and the intermediate reasoning. This explicit reasoning step enables more nuanced understanding of complex multimodal instructions. Our contributions are threefold. First, by leveraging a powerful MLLM reasoner, we achieve state-of-the-art performance on the MMEB-V2 benchmark, surpassing proprietary models trained on massive in-house datasets. Second, to reduce the dependency on large MLLM reasoners, we finetune a smaller MLLM reasoner using high-quality embedding-centric reasoning traces, achieving the best performance among open-source models with a 7% absolute gain over recently proposed models. Third, we investigate strategies for integrating the reasoner and embedder into a unified model for improved efficiency without sacrificing performance.
Problem

Research questions and friction points this paper is trying to address.

Improving multimodal embeddings for complex compositional reasoning tasks
Addressing limitations of encoder-only approaches in multimodal models
Enhancing understanding of complex multimodal instructions through explicit reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative reasoning traces enhance multimodal embedding
Fine-tuned smaller reasoner achieves top open-source performance
Integrated reasoner-embedder model improves efficiency without loss
🔎 Similar Papers
No similar papers found.