Vision-Free Retrieval: Rethinking Multimodal Search with Textual Scene Descriptions

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Existing contrastive vision-language models (e.g., CLIP) suffer from shallow language understanding, large modality gaps, and heavy reliance on massive web-scraped data—leading to high computational costs and privacy risks. This paper proposes “Vision-Free Retrieval”: a paradigm that eliminates the visual encoder entirely and instead leverages vision-language large language models (VLLMs) to generate structured textual image descriptions, enabling a pure text-to-text, single-encoder cross-modal retrieval framework. The method supports efficient fine-tuning of small-scale language models (as small as 0.3B parameters), requiring only two GPUs for a few hours. Key contributions include: (1) the first systematic challenge to the necessity of visual encoders in cross-modal retrieval; (2) the release of two highly compositional benchmarks—subFlickr and subCOCO; and (3) state-of-the-art performance on multi-task zero-shot retrieval, with significant improvements in compositional reasoning and privacy preservation.

Technology Category

Application Category

📝 Abstract

Contrastively-trained Vision-Language Models (VLMs), such as CLIP, have become the standard approach for learning discriminative vision-language representations. However, these models often exhibit shallow language understanding, manifesting bag-of-words behaviour. These limitations are reinforced by their dual-encoder design, which induces a modality gap. Additionally, the reliance on vast web-collected data corpora for training makes the process computationally expensive and introduces significant privacy concerns. To address these limitations, in this work, we challenge the necessity of vision encoders for retrieval tasks by introducing a vision-free, single-encoder retrieval pipeline. Departing from the traditional text-to-image retrieval paradigm, we migrate to a text-to-text paradigm with the assistance of VLLM-generated structured image descriptions. We demonstrate that this paradigm shift has significant advantages, including a substantial reduction of the modality gap, improved compositionality, and better performance on short and long caption queries, all attainable with only a few hours of calibration on two GPUs. Additionally, substituting raw images with textual descriptions introduces a more privacy-friendly alternative for retrieval. To further assess generalisation and address some of the shortcomings of prior compositionality benchmarks, we release two benchmarks derived from Flickr30k and COCO, containing diverse compositional queries made of short captions, which we coin subFlickr and subCOCO. Our vision-free retriever matches and often surpasses traditional multimodal models. Importantly, our approach achieves state-of-the-art zero-shot performance on multiple retrieval and compositionality benchmarks, with models as small as 0.3B parameters. Code is available at: https://github.com/IoannaNti/LexiCLIP

Problem

Research questions and friction points this paper is trying to address.

Addressing shallow language understanding in vision-language models

Reducing modality gap and computational costs in retrieval

Providing privacy-friendly alternative to image-based retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-free text-to-text retrieval pipeline

Uses VLLM-generated structured image descriptions

Achieves state-of-the-art with small 0.3B models

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs