🤖 AI Summary
To address three key bottlenecks in zero-shot compositional image retrieval (ZS-CIR)—static text embeddings, underutilization of image features, and inefficient vision-language fusion—this paper proposes Prompt Directional Vector (PDV), a training-free method. PDV dynamically generates text embeddings by modeling semantic shifts via directional vectors derived from prompts; transfers semantic priors into the image feature space to enhance visual representation; and introduces an adaptive weighting mechanism that fuses vision–language similarity scores based on their reliability. Entirely built upon a frozen, pre-trained CLIP model, PDV operates solely at inference time—performing vector scaling, feature projection, and fusion without any parameter updates or additional training. Evaluated on multiple ZS-CIR benchmarks, PDV consistently outperforms state-of-the-art methods, particularly improving the quality of compositional embeddings. It offers plug-and-play deployment and incurs negligible computational overhead.
📝 Abstract
Zero-shot composed image retrieval (ZS-CIR) enables image search using a reference image and text prompt without requiring specialized text-image composition networks trained on large-scale paired data. However, current ZS-CIR approaches face three critical limitations in their reliance on composed text embeddings: static query embedding representations, insufficient utilization of image embeddings, and suboptimal performance when fusing text and image embeddings. To address these challenges, we introduce the Prompt Directional Vector (PDV), a simple yet effective training-free enhancement that captures semantic modifications induced by user prompts. PDV enables three key improvements: (1) dynamic composed text embeddings where prompt adjustments are controllable via a scaling factor, (2) composed image embeddings through semantic transfer from text prompts to image features, and (3) weighted fusion of composed text and image embeddings that enhances retrieval by balancing visual and semantic similarity. Our approach serves as a plug-and-play enhancement for existing ZS-CIR methods with minimal computational overhead. Extensive experiments across multiple benchmarks demonstrate that PDV consistently improves retrieval performance when integrated with state-of-the-art ZS-CIR approaches, particularly for methods that generate accurate compositional embeddings. The code will be publicly available.