Embed-RL: Reinforcement Learning for Reasoning-Driven Multimodal Embeddings

📅 2026-02-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a reasoning-driven universal multimodal embedding framework (UME) to address the limitations of existing generative multimodal embedding methods, which rely on text-based reasoning chains irrelevant to retrieval and thus struggle with fine-grained cross-modal alignment. UME employs embedder-guided reinforcement learning (EG-RL) to optimize multimodal large language models, enabling them to generate traceable, retrieval-oriented chains-of-thought (T-CoT) that align the reasoning process with the embedding objective. Evaluated on the MMEB-V2 and UVRB benchmarks, the proposed method significantly outperforms current state-of-the-art models under limited computational resources, demonstrating enhanced cross-modal semantic consistency, improved fine-grained matching capability, and superior generalization in complex scenarios.

Technology Category

Application Category

📝 Abstract
Leveraging Multimodal Large Language Models (MLLMs) has become pivotal for advancing Universal Multimodal Embeddings (UME) in addressing diverse cross-modal tasks. Recent studies demonstrate that incorporating generative Chain-of-Thought (CoT) reasoning can substantially enhance task-specific representations compared to discriminative methods. However, the generated reasoning CoTs of existing generative embedding methods are limited to the textual analysis of queries and are irrelevant to the retrieval of the targets. To address these limitations, we propose a reasoning-driven UME framework that integrates Embedder-Guided Reinforcement Learning (EG-RL) to optimize the Reasoner to produce evidential Traceability CoT (T-CoT). Our key contributions are threefold: (1) We design an EG-RL framework where the Embedder provides explicit supervision to the Reasoner, ensuring the generated CoT traces are aligned with embedding tasks. (2) We introduce T-CoT, which extracts critical multimodal cues to focus on retrieval-relevant elements and provides multimodal inputs for the Embedder. (3) With limited computational resources, our framework outperforms the pioneering embedding model on both MMEB-V2 and UVRB benchmarks. The integration of multimodal evidence in structured reasoning, paired with retrieval-oriented alignment, effectively strengthens cross-modal semantic consistency and boosts the fine-grained matching capability of the model as well as the generalization across complex scenarios. Our work demonstrates that targeted reasoning optimization can significantly improve multimodal embedding quality, providing a practical and efficient solution for reasoning-driven UME development.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Embeddings
Chain-of-Thought Reasoning
Cross-modal Retrieval
Reasoning Alignment
Multimodal Large Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning
Multimodal Embeddings
Chain-of-Thought Reasoning
Traceability CoT
Embedder-Guided RL
🔎 Similar Papers
No similar papers found.
H
Haonan Jiang
Tsinghua Shenzhen International Graduate School, Tsinghua University; Kling Team, Kuaishou Technology
Yuji Wang
Yuji Wang
Tsinghua University
CVMultimodalSegmentationMLLM
Y
Yongjie Zhu
Kling Team, Kuaishou Technology
Xin Lu
Xin Lu
TikTok/Bytedance
Generative AIUnified Multimodal GenerationAgentic AI
Wenyu Qin
Wenyu Qin
Harbin Institute of Technology
Control
M
Meng Wang
Kling Team, Kuaishou Technology
Pengfei Wan
Pengfei Wan
Head of Kling Video Generation Models, Kuaishou Technology
Generative ModelsComputer VisionMultimodal AIComputer Graphics
Y
Yansong Tang
Tsinghua Shenzhen International Graduate School, Tsinghua University