FreeRet: MLLMs as Training-Free Retrievers

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Current multimodal large language models (MLLMs) require extensive post-training to support hybrid-modal retrieval. This paper proposes FreeRet, the first framework enabling zero-cost, out-of-the-box MLLM-based retrieval without any fine-tuning. FreeRet achieves this by decoupling the MLLM into two stages: semantic embedding and inference-driven re-ranking—via removal of token-alignment layers, incorporation of prior-conditioned representations, and a neutral selection mechanism. The method is model-agnostic, cross-modally generalizable, and compatible with MLLMs of varying parameter scales, while natively supporting end-to-end retrieval-augmented generation. Evaluated across 46 benchmarks in MMEB and MMEB-V2, FreeRet consistently surpasses million-parameter fine-tuned models. It significantly lowers deployment barriers and establishes a new paradigm for efficient, flexible, training-free multimodal retrieval.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) are emerging as versatile foundations for mixed-modality retrieval. Yet, they often require heavy post-hoc training to convert them into contrastive encoders for retrieval. This work asks: Can off-the-shelf MLLMs serve as powerful retrievers without additional training? We present FreeRet, a plug-and-play framework that turns any MLLM into a two-stage retriever. FreeRet first derives semantically grounded embeddings directly from the model for fast candidate search, and then exploits its reasoning ability for precise reranking. The framework contributes three advances: bypassing lexical alignment layers to obtain semantically faithful embeddings, conditioning representation generation with explicit priors, and mitigating framing effect in reranking via neutral choice framing. On the MMEB and MMEB-V2 benchmarks spanning 46 datasets, FreeRet substantially outperforms models trained on millions of pairs. Beyond benchmarks, FreeRet is model-agnostic and scales seamlessly across MLLM families and sizes, preserves their generative abilities, supports arbitrary modality combinations, and unifies retrieval, reranking, and generation into end-to-end RAG within a single model. Our findings demonstrate that pretrained MLLMs, when carefully harnessed, can serve as strong retrieval engines without training, closing a critical gap in their role as generalists.

Problem

Research questions and friction points this paper is trying to address.

Enabling MLLMs as retrievers without training

Deriving semantic embeddings directly from MLLMs

Unifying retrieval and reranking in single framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free MLLM framework for retrieval

Two-stage embedding and reasoning process

Model-agnostic end-to-end RAG unification

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs