Visualized Text-to-Image Retrieval

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing cross-modal embedding methods struggle to model fine-grained visual-spatial features, limiting text-to-image (T2I) retrieval performance. To address this, we propose VisRet—a novel “visualize-then-retrieve” paradigm that first synthesizes an image from the text query and then performs retrieval in the image space, thereby circumventing inherent challenges in cross-modal alignment. Our key contributions are: (1) introducing the first visualize-then-retrieve paradigm for T2I retrieval; (2) constructing the first benchmark tailored to multi-entity, knowledge-intensive T2I retrieval scenarios; and (3) enabling plug-and-play integration of T2I generation, image embedding, cross-modal knowledge enhancement, and retrieval-augmented generation (RAG) modules across models. Evaluated on three knowledge-intensive benchmarks, VisRet achieves 24.5–32.7% gains in NDCG@10 and significantly improves visual question answering accuracy. Code and benchmark are publicly released.

Technology Category

Application Category

📝 Abstract
We propose Visualize-then-Retrieve (VisRet), a new paradigm for Text-to-Image (T2I) retrieval that mitigates the limitations of cross-modal similarity alignment of existing multi-modal embeddings. VisRet first projects textual queries into the image modality via T2I generation. Then, it performs retrieval within the image modality to bypass the weaknesses of cross-modal retrievers in recognizing subtle visual-spatial features. Experiments on three knowledge-intensive T2I retrieval benchmarks, including a newly introduced multi-entity benchmark, demonstrate that VisRet consistently improves T2I retrieval by 24.5% to 32.7% NDCG@10 across different embedding models. VisRet also significantly benefits downstream visual question answering accuracy when used in retrieval-augmented generation pipelines. The method is plug-and-play and compatible with off-the-shelf retrievers, making it an effective module for knowledge-intensive multi-modal systems. Our code and the new benchmark are publicly available at https://github.com/xiaowu0162/Visualize-then-Retrieve.
Problem

Research questions and friction points this paper is trying to address.

Improves Text-to-Image retrieval via visual projection
Bypasses cross-modal limitations in visual-spatial recognition
Enhances retrieval-augmented visual question answering accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Projects text to image via T2I generation
Retrieves within image modality for accuracy
Plug-and-play with off-the-shelf retrievers
🔎 Similar Papers
No similar papers found.
D
Di Wu
University of California, Los Angeles
Yixin Wan
Yixin Wan
PhD student in Computer Science, University of California, Los Angeles
MultimodalLLMNatural Language ProcessingFairnessTrustworthiness
K
Kai-Wei Chang
University of California, Los Angeles