Category-level Text-to-Image Retrieval Improved: Bridging the Domain Gap with Diffusion Models and Vision Encoders

📅 2025-08-29

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

To address the modality gap that limits cross-modal matching performance of vision-language models (e.g., CLIP) in category-level text-to-image retrieval, this paper proposes a two-stage generative retrieval framework. First, a diffusion model synthesizes multiple semantically consistent visual query images from category-level textual descriptions. Second, a vision encoder extracts features from these generated images, enabling fine-grained similarity matching with target image features. Finally, a multimodal aggregation network jointly optimizes dual-path similarities—between generated images and target images, and between the original text and target images. The core innovation lies in explicitly bridging the distributional shift between text and real-image embedding spaces by leveraging generated images as semantic intermediaries. Extensive experiments on open-vocabulary benchmarks—including CUB-200 and Oxford-102—demonstrate significant improvements over pure text-based retrieval methods, achieving higher retrieval accuracy and stronger generalization to unseen categories.

Technology Category

Application Category

📝 Abstract

This work explores text-to-image retrieval for queries that specify or describe a semantic category. While vision-and-language models (VLMs) like CLIP offer a straightforward open-vocabulary solution, they map text and images to distant regions in the representation space, limiting retrieval performance. To bridge this modality gap, we propose a two-step approach. First, we transform the text query into a visual query using a generative diffusion model. Then, we estimate image-to-image similarity with a vision model. Additionally, we introduce an aggregation network that combines multiple generated images into a single vector representation and fuses similarity scores across both query modalities. Our approach leverages advancements in vision encoders, VLMs, and text-to-image generation models. Extensive evaluations show that it consistently outperforms retrieval methods relying solely on text queries. Source code is available at: https://github.com/faixan-khan/cletir

Problem

Research questions and friction points this paper is trying to address.

Bridging modality gap in text-to-image retrieval

Improving category-level semantic matching performance

Transforming text queries into visual representations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses diffusion models to generate visual queries

Employs vision encoders for image similarity estimation

Aggregates multiple generated images into single representation

🔎 Similar Papers

Unified Text-to-Image Generation and Retrieval