Embedding the Teacher: Distilling vLLM Preferences for Scalable Image Retrieval

📅 2025-10-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing embedded vision-language retrieval methods (e.g., CLIP) struggle to model abstract semantics and personalized queries (e.g., “a gift for a mother who enjoys gardening”), while vision-language large models (vLLMs) offer fine-grained alignment yet suffer from context-length limitations, rendering them unsuitable for large-scale retrieval. To bridge this gap, we propose a knowledge distillation framework: a vLLM is employed on small-batch data to generate text–image preference rankings, whose alignment priors are distilled into a lightweight embedding model. This distilled model preserves the scalability and efficiency of vector-based retrieval while substantially enhancing abstraction-aware semantic understanding. Experiments on personalized product recommendation demonstrate that our approach outperforms state-of-the-art embedding models, achieving an optimal trade-off between retrieval accuracy and deployment feasibility.

Technology Category

Application Category

📝 Abstract
Text--image retrieval is necessary for applications such as product recommendation. Embedding-based approaches like CLIP enable efficient large-scale retrieval via vector similarity search, but they are primarily trained on literal caption-like text--image pairs and often fail to capture abstract or persona-driven attributes common in product recommendation applications (e.g., ``a gift for a mother who loves gardening''). In contrast, state-of-the-art vision--language models (vLLMs) can align text with images in a flexible manner, but their limited context window prevents them from directly handling retrieval over large catalogs. We propose a framework that distills the preference rankings of a powerful vLLM into an embedding-based system, transferring its nuanced alignment abilities while maintaining the inference-time scalability of an embedding-based approach. Experiments on persona-driven product recommendation tasks demonstrate that our method significantly outperforms existing embedding-based baselines, providing an efficient solution for personalized text--image retrieval.
Problem

Research questions and friction points this paper is trying to address.

Embedding models fail to capture abstract attributes in product recommendations
Vision-language models cannot scale to large catalog retrieval
Distilling vLLM preferences enables scalable personalized image retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distilling vLLM preferences into embedding system
Transferring nuanced alignment abilities to embeddings
Maintaining scalable inference with embedding approach
🔎 Similar Papers
No similar papers found.