🤖 AI Summary
Existing compositional video retrieval methods either rely on outdated architectures or require costly fine-tuning and slow caption generation, limiting their ability to harness the full potential of modern vision-language models. This work proposes PREGEN, a framework that freezes pre-trained vision-language models and employs a lightweight encoder to extract semantically rich, compact embeddings from their multi-layer hidden states, enabling efficient, fine-tuning-free video retrieval. The approach demonstrates strong zero-shot generalization, significantly outperforming prior methods on the CoVR benchmark with Recall@1 improvements of +27.23 and +69.59, while maintaining robust performance across different backbone models and under complex textual perturbations.
📝 Abstract
Composed Video Retrieval (CoVR) aims to retrieve a video based on a query video and a modifying text. Current CoVR methods fail to fully exploit modern Vision-Language Models (VLMs), either using outdated architectures or requiring computationally expensive fine-tuning and slow caption generation. We introduce PREGEN (PRE GENeration extraction), an efficient and powerful CoVR framework that overcomes these limitations. Our approach uniquely pairs a frozen, pre-trained VLM with a lightweight encoding model, eliminating the need for any VLM fine-tuning. We feed the query video and modifying text into the VLM and extract the hidden state of the final token from each layer. A simple encoder is then trained on these pooled representations, creating a semantically rich and compact embedding for retrieval. PREGEN significantly advances the state of the art, surpassing all prior methods on standard CoVR benchmarks with substantial gains in Recall@1 of +27.23 and +69.59. Our method demonstrates robustness across different VLM backbones and exhibits strong zero-shot generalization to more complex textual modifications, highlighting its effectiveness and semantic capabilities.