Visualized Text-to-Image Retrieval

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Existing cross-modal embedding methods struggle to model fine-grained visual-spatial features, limiting text-to-image (T2I) retrieval performance. To address this, we propose VisRet—a novel “visualize-then-retrieve” paradigm that first synthesizes an image from the text query and then performs retrieval in the image space, thereby circumventing inherent challenges in cross-modal alignment. Our key contributions are: (1) introducing the first visualize-then-retrieve paradigm for T2I retrieval; (2) constructing the first benchmark tailored to multi-entity, knowledge-intensive T2I retrieval scenarios; and (3) enabling plug-and-play integration of T2I generation, image embedding, cross-modal knowledge enhancement, and retrieval-augmented generation (RAG) modules across models. Evaluated on three knowledge-intensive benchmarks, VisRet achieves 24.5–32.7% gains in NDCG@10 and significantly improves visual question answering accuracy. Code and benchmark are publicly released.

Technology Category

Application Category

📝 Abstract

We propose Visualize-then-Retrieve (VisRet), a new paradigm for Text-to-Image (T2I) retrieval that mitigates the limitations of cross-modal similarity alignment of existing multi-modal embeddings. VisRet first projects textual queries into the image modality via T2I generation. Then, it performs retrieval within the image modality to bypass the weaknesses of cross-modal retrievers in recognizing subtle visual-spatial features. Experiments on three knowledge-intensive T2I retrieval benchmarks, including a newly introduced multi-entity benchmark, demonstrate that VisRet consistently improves T2I retrieval by 24.5% to 32.7% NDCG@10 across different embedding models. VisRet also significantly benefits downstream visual question answering accuracy when used in retrieval-augmented generation pipelines. The method is plug-and-play and compatible with off-the-shelf retrievers, making it an effective module for knowledge-intensive multi-modal systems. Our code and the new benchmark are publicly available at https://github.com/xiaowu0162/Visualize-then-Retrieve.

Problem

Research questions and friction points this paper is trying to address.

Improves Text-to-Image retrieval via visual projection

Bypasses cross-modal limitations in visual-spatial recognition

Enhances retrieval-augmented visual question answering accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Projects text to image via T2I generation

Retrieves within image modality for accuracy

Plug-and-play with off-the-shelf retrievers

🔎 Similar Papers

No similar papers found.