Your Embedding Model is SMARTer Than You Think

📅 2026-05-24

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This work addresses the limitations of unimodal single-vector retrieval models, which struggle to preserve fine-grained local information in multimodal tasks, and existing multi-vector approaches that typically require retraining and lack effective global representations. To overcome these challenges, the authors propose SMART, a plug-and-play framework that activates the latent multi-vector capabilities embedded in frozen single-vector models during inference—without any additional training. By fusing global and local information through late interaction of intermediate hidden states, SMART achieves both efficient inference and lightweight adaptation. Experimental results demonstrate that SMART significantly outperforms state-of-the-art multi-vector models on the MMEB-V2 benchmark and visual document retrieval tasks, delivering superior retrieval performance at lower computational cost.

📝 Abstract

Multimodal retrieval relies heavily on single-vector retrievers, which compress rich, sequential token sequences into one single global representation. While efficient, they discard fine-grained, local evidence critical for dense retrieval tasks. Multi-vector approaches were introduced as a solution, but they strictly require training and many ignore the necessity of a globally summarizing representation. To address this, we introduce SMART, a framework that unlocks the latent multi-vector capabilities of standard single-vector models. We first demonstrate that standard contrastive training on the pooled embedding implicitly shapes the retrieval geometry of preceding hidden states via gradient flow. By applying direct late-interaction over these frozen hidden states during inference, SMART acts as a plug-and-play upgrade that consistently improves performance across diverse modalities, improving even the state-of-the-art models further on MMEB-V2. We also reveal SMART's superior performance, as simple lightweight post-training not only saves time and compute, but also brings forth further improvement on Visual Document retrieval, allowing a single-vector model to outperform SoTA multi-vector counterparts. Ultimately, SMART offers both a highly efficient inference enhancement and a powerful finetuning technique for multimodal retrieval. We open source our code and weights at https://github.com/HanSolo9682/SMART.

Problem

Research questions and friction points this paper is trying to address.

multimodal retrieval

single-vector retrievers

multi-vector approaches

dense retrieval

embedding models

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-vector retrieval

late-interaction

contrastive training