VIRTUE: Versatile Video Retrieval Through Unified Embeddings

📅 2026-01-17

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the challenge that existing video retrieval methods struggle to simultaneously support corpus-scale search, fine-grained temporal localization, and compositional multimodal queries. To this end, we propose the first unified video retrieval framework based on a multimodal large language model (MLLM), which leverages a shared MLLM backbone to generate aligned vision–text embeddings. The model is efficiently trained on 700K image–text pairs using contrastive learning and LoRA fine-tuning, and incorporates a re-ranking mechanism to enhance retrieval accuracy. Notably, our approach enables zero-shot clip-level retrieval without additional training and achieves state-of-the-art performance on compositional video retrieval tasks. After re-ranking, its results rival those of large-scale specialized models, demonstrating both versatility and effectiveness.

Technology Category

Application Category

📝 Abstract

Modern video retrieval systems are expected to handle diverse tasks ranging from corpus-level retrieval and fine-grained moment localization to flexible multimodal querying. Specialized architectures achieve strong retrieval performance by training modality-specific encoders on massive datasets, but they lack the ability to process composed multimodal queries. In contrast, multimodal LLM (MLLM)-based methods support rich multimodal search but their retrieval performance remains well below that of specialized systems. We present VIRTUE, an MLLM-based versatile video retrieval framework that integrates corpus and moment-level retrieval capabilities while accommodating composed multimodal queries within a single architecture. We use contrastive alignment of visual and textual embeddings generated using a shared MLLM backbone to facilitate efficient embedding-based candidate search. Our embedding model, trained efficiently using low-rank adaptation (LoRA) on 700K paired visual-text data samples, surpasses other MLLM-based methods on zero-shot video retrieval tasks. Additionally, we demonstrate that the same model can be adapted without further training to achieve competitive results on zero-shot moment retrieval, and state of the art results for zero-shot composed video retrieval. With additional training for reranking candidates identified in the embedding-based search, our model substantially outperforms existing MLLM-based retrieval systems and achieves retrieval performance comparable to state of the art specialized models which are trained on orders of magnitude larger data.

Problem

Research questions and friction points this paper is trying to address.

video retrieval

multimodal querying

moment localization

corpus-level retrieval

composed queries

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal LLM

unified embedding

contrastive alignment