🤖 AI Summary
This work addresses the challenge of efficiently adapting multimodal large language models (MLLMs) for cross-modal retrieval without disrupting their pretrained semantic space. The authors propose SLQ, a framework that appends a small set of shared implicit queries to the end of both text and image sequences, leveraging the native causal attention mechanism of a frozen MLLM to aggregate global features within a unified embedding space—effectively transforming the frozen model into a high-performance retriever. SLQ is the first method to achieve strong retrieval performance under full model freezing, thereby avoiding knowledge distortion caused by fine-tuning. The study also introduces KARR-Bench, a new benchmark for knowledge-aware reasoning retrieval. Experiments show that SLQ outperforms full fine-tuning and LoRA on COCO and Flickr30K, matches their performance on MMEB, and significantly surpasses them on KARR-Bench, demonstrating its efficacy and knowledge preservation capability.
📝 Abstract
Multimodal Large Language Models (MLLMs) exhibit strong reasoning and world knowledge, yet adapting them for retrieval remains challenging. Existing approaches rely on invasive parameter updates, such as full fine-tuning and LoRA, which may disrupt the pre-trained semantic space and impair the structured knowledge essential for reasoning. In this work, we argue that adapting MLLMs for retrieval should focus on eliciting pre-trained representations rather than overwriting them. To this end, we propose SLQ, an effective and efficient framework that adapts a frozen MLLM into a retriever through a small set of Shared Latent Queries. Appended to the end of both text and image token sequences, these queries leverage the model's native causal attention to serve as global aggregation interfaces, producing compact embeddings in a unified space while keeping the backbone unchanged. Furthermore, to better evaluate retrieval beyond superficial pattern matching, we construct KARR-Bench, a benchmark designed for knowledge-aware reasoning retrieval. Extensive experiments show that SLQ outperforms full fine-tuning and LoRA on COCO and Flickr30K, while achieving competitive performance on MMEB and yielding substantial gains on KARR-Bench. The results demonstrate that SLQ, which preserves pre-trained representations, provides an effective and efficient framework for adapting MLLMs to retrieval.