🤖 AI Summary
Existing embedded vision-language retrieval methods (e.g., CLIP) struggle to model abstract semantics and personalized queries (e.g., “a gift for a mother who enjoys gardening”), while vision-language large models (vLLMs) offer fine-grained alignment yet suffer from context-length limitations, rendering them unsuitable for large-scale retrieval. To bridge this gap, we propose a knowledge distillation framework: a vLLM is employed on small-batch data to generate text–image preference rankings, whose alignment priors are distilled into a lightweight embedding model. This distilled model preserves the scalability and efficiency of vector-based retrieval while substantially enhancing abstraction-aware semantic understanding. Experiments on personalized product recommendation demonstrate that our approach outperforms state-of-the-art embedding models, achieving an optimal trade-off between retrieval accuracy and deployment feasibility.
📝 Abstract
Text--image retrieval is necessary for applications such as product recommendation. Embedding-based approaches like CLIP enable efficient large-scale retrieval via vector similarity search, but they are primarily trained on literal caption-like text--image pairs and often fail to capture abstract or persona-driven attributes common in product recommendation applications (e.g., ``a gift for a mother who loves gardening''). In contrast, state-of-the-art vision--language models (vLLMs) can align text with images in a flexible manner, but their limited context window prevents them from directly handling retrieval over large catalogs. We propose a framework that distills the preference rankings of a powerful vLLM into an embedding-based system, transferring its nuanced alignment abilities while maintaining the inference-time scalability of an embedding-based approach. Experiments on persona-driven product recommendation tasks demonstrate that our method significantly outperforms existing embedding-based baselines, providing an efficient solution for personalized text--image retrieval.