UNION: A Lightweight Target Representation for Efficient Zero-Shot Image-Guided Retrieval with Optional Textual Queries

📅 2025-11-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the fragmentation between compositional image retrieval (CIR) and sketch-based image retrieval (SBIR) in zero-shot image retrieval, as well as the challenge of cross-modal semantic alignment under low supervision. To this end, we propose IGROT, a unified framework, and UNION, a lightweight target representation method. UNION requires no architectural modification; it achieves optional-text-conditioned semantic alignment by fusing pre-trained vision-language model image embeddings with null-text prompts. With only 5,000 annotated samples for fine-tuning, UNION achieves 38.5 mAP@50 on CIRCO and 82.7 mAP@200 on Sketchy—surpassing most fully supervised methods. To our knowledge, this is the first work to efficiently support both zero-shot CIR and SBIR within a single framework, enabling joint image-plus-optional-text guidance without task-specific design.

Technology Category

Application Category

📝 Abstract
Image-Guided Retrieval with Optional Text (IGROT) is a general retrieval setting where a query consists of an anchor image, with or without accompanying text, aiming to retrieve semantically relevant target images. This formulation unifies two major tasks: Composed Image Retrieval (CIR) and Sketch-Based Image Retrieval (SBIR). In this work, we address IGROT under low-data supervision by introducing UNION, a lightweight and generalisable target representation that fuses the image embedding with a null-text prompt. Unlike traditional approaches that rely on fixed target features, UNION enhances semantic alignment with multimodal queries while requiring no architectural modifications to pretrained vision-language models. With only 5,000 training samples - from LlavaSCo for CIR and Training-Sketchy for SBIR - our method achieves competitive results across benchmarks, including CIRCO mAP@50 of 38.5 and Sketchy mAP@200 of 82.7, surpassing many heavily supervised baselines. This demonstrates the robustness and efficiency of UNION in bridging vision and language across diverse query types.
Problem

Research questions and friction points this paper is trying to address.

Unifies composed and sketch-based image retrieval tasks.
Enhances semantic alignment with multimodal queries efficiently.
Achieves competitive results with minimal training data.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight target representation fuses image embedding with null-text prompt
Requires no architectural modifications to pretrained vision-language models
Achieves competitive results with only 5,000 training samples
🔎 Similar Papers
No similar papers found.