🤖 AI Summary
This work addresses the challenge that existing vision foundation models (e.g., CLIP) struggle to generate image embeddings conditioned on fine-grained textual attributes—such as color or artistic style—without explicit supervision. We propose DIOR, the first training-free, zero-shot conditional embedding framework. DIOR leverages word-level prompting in large vision-language models (LVLMs) and extracts token-level image features aligned with textual conditions directly from the final hidden states—requiring neither fine-tuning nor additional training. By enabling fine-grained semantic focusing via prompt engineering alone, DIOR achieves state-of-the-art performance on multiple conditional image similarity retrieval benchmarks, surpassing both training-free baselines (e.g., CLIP) and leading supervised methods. This is the first demonstration of fully prompt-driven conditional embedding feasibility, establishing a new paradigm for interpretable, lightweight visual-semantic manipulation.
📝 Abstract
Conditional image embeddings are feature representations that focus on specific aspects of an image indicated by a given textual condition (e.g., color, genre), which has been a challenging problem. Although recent vision foundation models, such as CLIP, offer rich representations of images, they are not designed to focus on a specified condition. In this paper, we propose DIOR, a method that leverages a large vision-language model (LVLM) to generate conditional image embeddings. DIOR is a training-free approach that prompts the LVLM to describe an image with a single word related to a given condition. The hidden state vector of the LVLM's last token is then extracted as the conditional image embedding. DIOR provides a versatile solution that can be applied to any image and condition without additional training or task-specific priors. Comprehensive experimental results on conditional image similarity tasks demonstrate that DIOR outperforms existing training-free baselines, including CLIP. Furthermore, DIOR achieves superior performance compared to methods that require additional training across multiple settings.