Category-level Text-to-Image Retrieval Improved: Bridging the Domain Gap with Diffusion Models and Vision Encoders

📅 2025-08-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the modality gap that limits cross-modal matching performance of vision-language models (e.g., CLIP) in category-level text-to-image retrieval, this paper proposes a two-stage generative retrieval framework. First, a diffusion model synthesizes multiple semantically consistent visual query images from category-level textual descriptions. Second, a vision encoder extracts features from these generated images, enabling fine-grained similarity matching with target image features. Finally, a multimodal aggregation network jointly optimizes dual-path similarities—between generated images and target images, and between the original text and target images. The core innovation lies in explicitly bridging the distributional shift between text and real-image embedding spaces by leveraging generated images as semantic intermediaries. Extensive experiments on open-vocabulary benchmarks—including CUB-200 and Oxford-102—demonstrate significant improvements over pure text-based retrieval methods, achieving higher retrieval accuracy and stronger generalization to unseen categories.

Technology Category

Application Category

📝 Abstract
This work explores text-to-image retrieval for queries that specify or describe a semantic category. While vision-and-language models (VLMs) like CLIP offer a straightforward open-vocabulary solution, they map text and images to distant regions in the representation space, limiting retrieval performance. To bridge this modality gap, we propose a two-step approach. First, we transform the text query into a visual query using a generative diffusion model. Then, we estimate image-to-image similarity with a vision model. Additionally, we introduce an aggregation network that combines multiple generated images into a single vector representation and fuses similarity scores across both query modalities. Our approach leverages advancements in vision encoders, VLMs, and text-to-image generation models. Extensive evaluations show that it consistently outperforms retrieval methods relying solely on text queries. Source code is available at: https://github.com/faixan-khan/cletir
Problem

Research questions and friction points this paper is trying to address.

Bridging modality gap in text-to-image retrieval
Improving category-level semantic matching performance
Transforming text queries into visual representations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses diffusion models to generate visual queries
Employs vision encoders for image similarity estimation
Aggregates multiple generated images into single representation
🔎 Similar Papers
No similar papers found.
F
Faizan Farooq Khan
King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
Vladan Stojnić
Vladan Stojnić
Czech Technical University in Prague
computer visionmachine learning
Zakaria Laskar
Zakaria Laskar
IISER TVM
computer visionmachine learning
M
Mohamed Elhoseiny
King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
Giorgos Tolias
Giorgos Tolias
Czech Technical University in Prague
Computer VisionImage retrieval