Think When Needed: Adaptive Reasoning-Driven Multimodal Embeddings with a Dual-LoRA Architecture

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

This work addresses the computational redundancy and potential performance degradation in existing chain-of-thought (CoT)-based multimodal embedding methods, which uniformly apply CoT reasoning across all samples regardless of necessity. To overcome this limitation, the authors propose a unified adaptive inference embedding framework that freezes the backbone network and employs dual LoRA adapters for parameter-efficient fine-tuning. A self-supervised routing gating mechanism is introduced to selectively activate CoT reasoning only when needed, while embedding-guided reinforcement learning further enhances inference quality. Evaluated on the MMEB-V2 benchmark across 78 tasks, the proposed method achieves state-of-the-art performance with only a 3–5% increase in parameter count and reduces CoT-generated tokens by up to 50% during inference.

📝 Abstract

Multimodal large language models (MLLMs) have emerged as a powerful backbone for multimodal embeddings. Recent methods introduce chain-of-thought (CoT) reasoning into the embedding pipeline to improve retrieval quality, but remain costly in both model size and inference cost. They typically employ separate reasoner and embedder with substantial parameter overhead, and generate CoT indiscriminately for every input. However, we observe that for simple inputs, discriminative embeddings already perform well, and redundant reasoning can even mislead the model, degrading performance. To address these limitations, we propose Think When Needed (TWN), a unified multimodal embedding framework with adaptive reasoning. TWN introduces a dual-LoRA architecture that attaches reasoning and embedding adapters to a shared frozen backbone, detaching gradients at their interface to mitigate gradient conflicts introduced by joint optimization while keeping parameters close to a single model. Building on this, an adaptive think mechanism uses a self-supervised routing gate to decide per input whether to generate CoT, skipping unnecessary reasoning to reduce inference overhead and even improve retrieval quality. We further explore embedding-guided RL to optimize CoT quality beyond supervised training. On the 78 tasks of MMEB-V2, TWN achieves state-of-the-art embedding quality while being substantially more efficient than existing generative methods, requiring only 3-5% additional parameters relative to the backbone and up to 50% fewer reasoning tokens compared to the full generative mode.

Problem

Research questions and friction points this paper is trying to address.

multimodal embeddings

chain-of-thought reasoning

inference efficiency

redundant reasoning

parameter overhead

Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive reasoning

dual-LoRA architecture

multimodal embeddings