🤖 AI Summary
This paper addresses the fragmentation between compositional image retrieval (CIR) and sketch-based image retrieval (SBIR) in zero-shot image retrieval, as well as the challenge of cross-modal semantic alignment under low supervision. To this end, we propose IGROT, a unified framework, and UNION, a lightweight target representation method. UNION requires no architectural modification; it achieves optional-text-conditioned semantic alignment by fusing pre-trained vision-language model image embeddings with null-text prompts. With only 5,000 annotated samples for fine-tuning, UNION achieves 38.5 mAP@50 on CIRCO and 82.7 mAP@200 on Sketchy—surpassing most fully supervised methods. To our knowledge, this is the first work to efficiently support both zero-shot CIR and SBIR within a single framework, enabling joint image-plus-optional-text guidance without task-specific design.
📝 Abstract
Image-Guided Retrieval with Optional Text (IGROT) is a general retrieval setting where a query consists of an anchor image, with or without accompanying text, aiming to retrieve semantically relevant target images. This formulation unifies two major tasks: Composed Image Retrieval (CIR) and Sketch-Based Image Retrieval (SBIR). In this work, we address IGROT under low-data supervision by introducing UNION, a lightweight and generalisable target representation that fuses the image embedding with a null-text prompt. Unlike traditional approaches that rely on fixed target features, UNION enhances semantic alignment with multimodal queries while requiring no architectural modifications to pretrained vision-language models. With only 5,000 training samples - from LlavaSCo for CIR and Training-Sketchy for SBIR - our method achieves competitive results across benchmarks, including CIRCO mAP@50 of 38.5 and Sketchy mAP@200 of 82.7, surpassing many heavily supervised baselines. This demonstrates the robustness and efficiency of UNION in bridging vision and language across diverse query types.