🤖 AI Summary
To address the modality gap that limits cross-modal matching performance of vision-language models (e.g., CLIP) in category-level text-to-image retrieval, this paper proposes a two-stage generative retrieval framework. First, a diffusion model synthesizes multiple semantically consistent visual query images from category-level textual descriptions. Second, a vision encoder extracts features from these generated images, enabling fine-grained similarity matching with target image features. Finally, a multimodal aggregation network jointly optimizes dual-path similarities—between generated images and target images, and between the original text and target images. The core innovation lies in explicitly bridging the distributional shift between text and real-image embedding spaces by leveraging generated images as semantic intermediaries. Extensive experiments on open-vocabulary benchmarks—including CUB-200 and Oxford-102—demonstrate significant improvements over pure text-based retrieval methods, achieving higher retrieval accuracy and stronger generalization to unseen categories.
📝 Abstract
This work explores text-to-image retrieval for queries that specify or describe a semantic category. While vision-and-language models (VLMs) like CLIP offer a straightforward open-vocabulary solution, they map text and images to distant regions in the representation space, limiting retrieval performance. To bridge this modality gap, we propose a two-step approach. First, we transform the text query into a visual query using a generative diffusion model. Then, we estimate image-to-image similarity with a vision model. Additionally, we introduce an aggregation network that combines multiple generated images into a single vector representation and fuses similarity scores across both query modalities. Our approach leverages advancements in vision encoders, VLMs, and text-to-image generation models. Extensive evaluations show that it consistently outperforms retrieval methods relying solely on text queries. Source code is available at: https://github.com/faixan-khan/cletir