PLUME: Latent Reasoning Based Universal Multimodal Embedding

📅 2026-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost and information bottlenecks associated with explicit chain-of-thought (CoT) reasoning in general-purpose multimodal embeddings. To overcome these limitations, the authors propose PLUME, a framework that replaces explicit CoT with a short autoregressive expansion of continuous latent states, enabling efficient reasoning within a fixed computational budget. PLUME incorporates semantic-anchor-guided transition adapters and a progressive training curriculum that shifts from explicit to implicit reasoning, thereby establishing a structured implicit inference mechanism. Experimental results demonstrate that PLUME outperforms existing explicit CoT methods on the MMEB-v2 benchmark, compressing reasoning steps from hundreds of tokens to fewer than ten and achieving over a 30-fold speedup.
📝 Abstract
Universal multimodal embedding (UME) maps heterogeneous inputs into a shared retrieval space with a single model. Recent approaches improve UME by generating explicit chain-of-thought (CoT) rationales before extracting embeddings, enabling multimodal large language models to better infer complex query intent. However, explicit CoT incurs substantial inference overhead and can compress rich multimodal evidence into a narrow textual bottleneck. We propose PLUME, a latent reasoning framework that advances UME by replacing verbalized CoT with a short autoregressive rollout of continuous latent states. To support diverse multimodal queries, PLUME further introduces a semantic-anchor-guided transition adapter that steers latent rollout along different reasoning trajectories under the same fixed computation budget. To stabilize training, PLUME adopts a progressive explicit-to-latent curriculum that uses verbalized reasoning only as a temporary training scaffold and gradually transfers this behavior into hidden-state computation, eliminating explicit CoT at inference. On the 78-task MMEB-v2 benchmark, PLUME outperforms strong explicit-CoT UME baselines while reducing reasoning from hundreds of generated tokens to fewer than 10 latent steps, delivering over 30x faster inference. PLUME is especially well suited to retrieval settings where relevant evidence is dense, structurally complex, and difficult to organize through verbalized intermediate rationales, such as video and visual document retrieval. These results show that structured latent computation can preserve the benefits of intermediate reasoning without the overhead of explicit rationale generation, providing a stronger and more efficient paradigm for practical retrieval systems.
Problem

Research questions and friction points this paper is trying to address.

universal multimodal embedding
chain-of-thought reasoning
multimodal retrieval
inference overhead
latent reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

latent reasoning
universal multimodal embedding
chain-of-thought
autoregressive rollout
semantic-anchor-guided adapter
🔎 Similar Papers
No similar papers found.
C
Chenwei He
Southeast University
Xiangzhao Hao
Xiangzhao Hao
Institute of Automation, Chinese Academy of Sciences
Multimodal Large Language ModelsReinforcement LearningMultimodal Retrieval
T
Tianyu Yang
Institute of Automation, Chinese Academy of Sciences; University of Chinese Academy of Sciences
Y
Yuxiang Ma
Southeast University
Y
Yuheng Jia
Southeast University
L
Lingxiang Wu
Institute of Automation, Chinese Academy of Sciences; University of Chinese Academy of Sciences
Chaoyang Zhao
Chaoyang Zhao
Institute of Automation, Chinese Academy of Sciences
computer vision
Haiyun Guo
Haiyun Guo
Rice University ECE Ph.D.
optical imagingcomputational photographyMetalens
J
Jinqiao Wang
Institute of Automation, Chinese Academy of Sciences; University of Chinese Academy of Sciences