๐ค AI Summary
To address the insufficient discriminability of large-scale vision-language models (VLMs) in ultra-large-scale instance retrieval, this paper proposes a query-adaptive linear feature space transformation method: for each text or image query, it dynamically generates a lightweight, learnable, query-specific projection matrix to enable query-level personalized cross-modal feature mapping. Built upon VLM embeddings, the method jointly optimizes the query encoder and domain adaptation objective, achieving substantial retrieval accuracy gains while incurring negligible inference overheadโonly ~0.1% additional parameters. It consistently outperforms state-of-the-art methods on major large-scale instance retrieval benchmarks (e.g., Instre, Oxford-Paris-105K, GLDv2), reduces re-ranking latency by one to two orders of magnitude, and enables real-time retrieval over tens of millions of images. The core contribution is the first introduction of query-driven linear transformations into VLM-based cross-modal retrieval, thereby overcoming the representational bottleneck imposed by fixed projection matrices.
๐ Abstract
Massive-scale pretraining has made vision-language models increasingly popular for image-to-image and text-to-image retrieval across a broad collection of domains. However, these models do not perform well when used for challenging retrieval tasks, such as instance retrieval in very large-scale image collections. Recent work has shown that linear transformations of VLM features trained for instance retrieval can improve performance by emphasizing subspaces that relate to the domain of interest. In this paper, we explore a more extreme version of this specialization by learning to map a given query to a query-specific feature space transformation. Because this transformation is linear, it can be applied with minimal computational cost to millions of image embeddings, making it effective for large-scale retrieval or re-ranking. Results show that this method consistently outperforms state-of-the-art alternatives, including those that require many orders of magnitude more computation at query time.